Уу, мы 10 A 


оз 


Measurement and Evaluation 


in Psyehology and Edueation 


Measurement and Evaluation 


im Psychology and Education 


by 
ROBERT L. THORNDIKE 


Professor of Education 
Teachers College, Columbia University 


and 


ELIZABETH HAGEN 


Instructor in Psychological Foundations 
Teachers College, Columbia University 


NEW YORK - JOHN WILEY & SONS, INC. 
LONDON - CHAPMAN & HALL, LIMITED 


COPYRIGHT, 1955 
BY 
& Sons, INC. 


Jous Wu 


AIL Rights Reserved 
This book or any part thereof must not 


be reproduced in any form without 
the written permission of the publisher. 


PRINTED IN THE UNITED STATES OF AMERICA 


o face ERE 


For a number of years, we have given a course at Teachers Col- 
lege, Columbia University, to classroom teachers, administrators, 
guidance workers, and major students in educational, developmen- 
tal, personnel, or clinical psychology. This book evolved out of 
the needs of that course. It undertakes to provide the foundations 
that these workers in different branches of education and psychology 
will need in order to use and interpret tests, move ahead into more 
specialized testing courses, and go ahead independently to study 
their own practical testing problems. 

A person who is to use tests must have a sound general background 
of concepts of reliability and validity, different types of norms for 
expressing test results, and the values and limitations of different 
ways of studying people. He also needs to have this general back- 
ground focused on the specific practical problems of selecting ap- 
propriate tests for use, planning a suitable testing. program for a 
school or other setting, writing better test items, and reporting and 
interpreting test results. We have tried to provide both a founda- 
tion in basic concepts and guidance and illustrations of the applica- 
tion of these concepts to practical problems in schools, counseling, 
or personnel work. 

The general orientation of the book is practical. It is designed 
primarily for the person who is going to use and interpret tests, not 
for the person who is going to theorize about them. This shows up 
in the selection of material for inclusion and in the treatment of 
many topics. For example: 

1. We have emphasized the need to define the objectives of meas- 
urement. This has been emphasized in relation to teacher-made 
tests, in the discussion of test validity, and in the consideration of 
testing programs. We feel strongly that too much emphasis is often 
given to the mechanics of measurement, and too little to the ways 


in which the results of testing or other evaluation procedures are 
to be used. We have tried to restore objectives and purposes to a 
central position in thinking about testing. 


v 


f PREFACE 


2. We have tried to restrict the presentations of computational 
routines in statistics to those that the grassroots user of tests will 
have need to compute. a Other statistical material has been pre- 
sented with a view to developing understanding consumers, rather 
than producers, of statistics. | | 

3. We have provided @ large number of specific illustrations of 
poor test items, all relating to a single unit of instruction, and have 
shown how cach might be improved. 

4. We have tried to develop a healthy respect for the error. of 
measurement in any appraisal procedure, and to encourage tenta- 
tiveness in test interpretation. The absolutism that has character- 
ized many judgments based on tests in the past seems to us quite 
undesirable. We would like teachers, counsclors, and psychologists 
to use tests, but to use them circumspectly, realizing that they lead 
to probable inferences rather than fixed conclusions. 

5. Our treatments of teacher-made tests, school testing programs, 
and reporting and marking systems are planned to give help with 
immediate practical problems that face the teacher and school ad- 
ministrator, at whatever level. However, while approaching these 
topics from a practical standpoint, we have tried to base our dis- 
cussion upon sound analysis of the underlying logical and method- 
ological issues, and to avoid the superficial treatment that has some- 
times been given to these topics. 

6. The discussion of cach type of measurement technique includes 
a critical appraisal of the strengths and weaknesses of that tech- 
nique. We try to indicate the ways in which the instruments can 
effectively be used and to point out uses that should be avoided. 

7. Knowing that it will be impossible to describe and evaluate 
tests at all levels and in all special fields, we have m 
acquainting the student with sources through whicl 
more about tests of any particular type. 


ade a point of 
a he can learn 
Chapter 8 is devoted to 
educating the student in how to find out for himself anything further 
that he wants to know about a particular test, tests of a particular 
sort, or some special testing problem of interest to him. 


The book has been tried in a preliminary photo-offset version, and 


cach chapter has profited from detailed and thoug 


htful criticism by 
our students. 


Many improvements have been made, but others are 
undoubtedly possible. We will appreciate any suggestions that users 
of the book may have for improving it in subsequent editions. 


ROBERT L. THORNDIKE 


New York, N.Y. ELIZABETH HAGEN 
April, 1955 


Contents ЫЗ, 


CHAPTER 


15 


2 


Sou 


Historical and Philosophical Orientation 


2. Overview of Measurement Methods 


. The Teacher's Own Tests 


Preparing Objective Tests 


Elementary Statistical Concepts 


. Qualities Desired in Any Measurement Procedure 


. Norms and Units for Measurement 


Where to Find Information about Specific Tests 


. Standardized Tests of Intelligence or Scholastic 


Aptitude 


. The Measurement of Special Aptitudes 


. Achievement Tests 


Behavioral Measures of Personality 


. The Individual as Others See Him 


. Questionnaires and Inventories for Self-Appraisal 


Projective Tests 
Planning a School Testing Program , 
Marking and Reporting 


Measurement in Educational and Vocational 
Guidance 


viii 


CHAPTER 


CONTENTS 


19. Tests in the Selection of Personnel 


20. Measurement in Diagnosis and Therapy 


APPENDIX 


1. Computation of Square Root 


2. Calculating the Correlation Coefficient 


3. Section A 
Section B 
Section C 

Section D 

E 


Section 
Section F 
Section G 


General Intelligence Tests 

Aptitude Test Batteries 

Reading Tests 

Elementary-School Achievement Bat- 
teries 

High-School Achievement Batteries 
Interest Inventories 

Adjustment and Temperament In- 
ventories 


4. Sources for Educational and Psychological Tests 


Index 


Chapter 1 


Historieal and Philosophical 
Orientation 


HISTORICAL BACKGROUND 


The roots of the measurement of man lie in antiquitv. We must 
believe that even in prehistoric times Og, the cave man, made rudi- 
mentary appraisals of his fellows. He saw Zog go by, made some 
such judgment as "Big, strong, keep out of way," and acted upon 
it; or he came upon the campfire of Wog, observed “Small, weak, 
take dinner," and did so forthwith. But for much of recorded his- 
tory, the appraisals that man has made of his fellows have been of 
this crude subjective type. 

He who seeks imaginatively can find suggestions of more syste- 
matic and refined methods. Thus, the tournaments of the days of 
chivalry can be thought of as an effort to arrange men in an order 
from best to worst in feats of arms, and contests leading to the 
crowning of "champions" have always constituted a rough sort of 
measurement. Teachers have always catechized their pupils to ap- 
praise their degree of mastery of the tasks assigned them, evaluating 
them as best they could by their responses. But these approaches 
were more primitive than the sun dial and the ox cart. They are 
characteristic of the appraisal of man and his behavior up to the 
present century. Application of the quantitative methods of science 
to psychology and education is very new. In 1850 there was almost 
none of it; 1900 was still a pioneering period. 


EARLY EDUCATIONAL TESTING 


The appraisal of educational achievement in the United States 
before 1850 had relied very largely upon oral examination. The 
teacher or visiting examiner asked a question. The designated 
pupil undertook to answer it. The questioner arrived at an imme- 
diate subjective evaluation of the answer. There was uniformity: 
neither in the questions asked different pupils nor in the evaluation 

1 


2 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


of their replies. The method was burdensome and inefficient, since 
only one pupil could be tested at a time. It provided no comparabil- 
ity from pupil to pupil either in the task or in the evaluation of it. 

During the latter half of the nineteenth century, oral examinations 
bv boards of visitors were replaced by set written examinations as 
a basis for promotion or admission to an academy or college. Out- 
side examination in turn yielded to evaluation. by the classroom 
teacher. Whether carried out by an outside examiner or by a teacher, 
however, the technique was that of the 


ssav examination, in which 
a pupil responded in his own words to a question set by the examiner. 

The written. examination had the advantages of (1) presenting 
the same tasks to cach member of the group and (2) letting cach 
pupil work for the full examination period. However, though the 
task was made uniform, at least for the members of a given class, 
appraisal of cach individual's response to the task remained highly 
subjective, depending upon the standards and prejudices of the 
particular scorer. As we shall see in Chapter 3, great variations 
were found in the scoring of a particular paper. Only since 1900 
has there been any general development of objectively scored tests 
in which a pre-established key can be routinely and uniformly ap- 
plied to the responses made by cach pupil. Only since 1900 has the 
idea emerged of a general standard of performance for an age or 
grade, with which the performance by any class or any individual 
may be compared. 


THE BEGINNINGS OF PSYCHOLOGICAL MEASUREMENT 
Psychology in 1850 was still in large measure 
Courses dealing with man and his actions were presented under the 
title “Moral Philosophy,” and discussed in an armchair fashion the 


nature of the Mind and the Soul. Psychology was almost entirely 
non-experimental, and the idea that on i 


tative terms the speed of responding, tl 
the level of intelligence would have 
with hostility or, more probably, ig 
The nearest approaches to psyc 
scattered. experiments by physicists and physiologists on the meas- 
urement of the ability to make sensory discriminations and the 
speed of simple elementary responses. 

Ву 1900 psychology had felt 
logical sciences and was striving mightily to become 
It was shaking off the ties that bound it to 
new alliances with the biological scienc 


а part of philosophy. 


¢ could measure in quanti- 
зе amount of forgetting, or 
been received in most quarters 
nored as not worthy of rebuttal. 
hological measurement were a few 


the impact of the physical and bio- 
a science itself. 
philosophy and forming 
It had adopted the ex- 


BEGINNINGS OF EXPERIMENTAL PSYCHOLOGY 3 


perimental method and was measurement-conscious. The basic 
tool of experimentation is measurement, and psychology was ex- 
panding its measurement techniques in all directions. The record 
since 1900 is the record of the attempt to expand and adapt measure- 
ment techniques to cover all aspects of human behavior. 

Three main streams combined to yield the vigorous measurement 
movement in psychology and its spread through education. Some 
of the flavor and some of the emphasis have come from each stream. 
These were (1) the physiological and experimental psychology that 
had its main growth in Germany in the nineteenth century, (2) 
Darwinian biology, and (3) the clinical concern for the maladjusted 
and underdeveloped individual. 


BEGINNINGS OF EXPERIMENTAL PSYCHOLOGY 

The modern scientific era was first ushered into the physical 
sciences in the seventeenth and cighteenth centuries. Scientific in- 
terest and method soon spread over to the biological sciences, and 
by the carly nineteenth century experimental physiology was a cen- 
ter of active research interest in the experimental laboratories in 
Germany and other European countries. Experimental physiol- 
ed in the operation of the senses, studying in- 
tensively seeing, hearing, and the other senses. Physiologists also 


ogists became interes 


became interested in measuring the speed of simple motor responses 

In 1879 the first laboratory for experimental psychology was 
established by Wilhelm Wundt at Leipzig. Early experimental 
psychologists were interested in many of the same measurements 
that had concerned the physiologists. These were measures of 
seeing, hearing, fecling, and speed of response. But gradually they 
extended their concern to more clearly psychological matters, such 
as measurement of perceptual span—the amount that the individual 
can "take in" at once, of rate of learning, of the timing of complex 
mental tasks, and so forth. 

One area of particular interest for its contribution to the broad 
field of psychological and educational measurement was that known 


as psychophysics. The experimental psychologist was much inter- 
ested in exploring the relationship between physical stimulus in- 
tensities, e.g., of light wave or of sound wave, and the experienced 
intensity of the resulting sensation. The designing of effective 
experimental procedures for studying these problems gave rise to a 
set of techniques that have proved adaptable to a wide range of 
problems of psychological measurement. 

From experimental psychology came a legacy of respect for care- 


4 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


ful experimental method and precision of technique, a number of 
experimental designs, and statistical techniques that could be car- 
ried over to more general psychological and educational measure- 
ment problems. 


EARLY STUDY OF INDIVIDUAL DIFFERENCES 


A second stream contributing to psychological measurement was 
Darwinian biology. In 1859 Darwin brought out his Origin of Spe- 
cies. The basic concern in Darwin's work was with variation among 
the members of a species, that is, individual differences. Darwin's 
work was followed up in England and applied to distinctively human 
affairs, particularly by Sir Francis Galton. Whereas German psy- 
chology had focused on finding the general facts true of all people, 
Galton became interested primarily in the differences among people. 
Stimulated by Darwin to study the inheritance of traits, he gath- 
ered data both on physical and on psychological characteristics. 
The study of these individual differences required better statistical 
tools, and the British group, under the leadership of Karl Pearson, 
developed improved techniques for analyzing and describing the pat- 
terns of individual differences. 

These, then, were the two main contributions of the British group 
to the growth of psychological measurement: a deep concern for 
studying the differences among people as interesting and significant 
facts and appropriate statistical techniques and tools for carrving 
out this study. 


CLINICAL STUDY OF DEVIATES 


During this same period, a third stream was gathering strength. 
This was concern for the individual who was not functioning suc- 
cessfully. Humanitarian concern for the insane, the feeble-minded, 
and the general misfit led in the nineteenth century to active re- 
search and investigation aimed toward understanding their condi- 
tion and improving their lot. This clinical interest in the malad- 
justed individual was particularly strong in France, and it was here 
that it bore fruit for the field of measurement. As psychologists 
worked with these unfortunate deviates, the need became more 
and more apparent for some uniform way of expressing the degree 
of their defect, particularly in the mental sphere. It was in this 
context of concern for the child who was not getting along in school 
that Binet and his colleagues developed the series of intellectual 


tasks that ultimately grew into the whole array of measures of in- 
telligence. 


MEASUREMENT IN THE TWENTIETH CENTURY 5 


SYNTHESIS IN THE UNITED STATES 


By the carly years of the present century. all these streams of 
influence had made themselves felt in the United States. James 
McKeen Cattell had taken his graduate work in psychology in 
Germany with Wundt, where he had received a good grounding 
in quantitative and experimental psychology. But he had also been 
exposed to the work of Galton and had developed a lasting interest 
in individual differences and statistical method. When he returned 
to the United States, he began an investigation of individual differ- 
ences in the simple sensory and motor performances that were being 
measured in German psychological laboratories. He studied the 
relationship between these performances and academic success. 

E. L. Thorndike was a student of Cattell's just before the turn 
of the century and became a focal influence in the spread and de- 
velopment of standardized educational tests. Both his own work 
and that of a large group of students rapidly spread the gospel of 
objective measurement in education. 

The work of Binet was eagerly seized upon in this country. His 
tests were translated and produced in several versons, of which by 
far the most influential became the Stanford-Binet produced by Lewis 
Terman in 1916. The testing movement seemed especially suited 
to the temper of this country and took hold here with a vigor and 
enthusiasm unequaled elsewhere. 


MEASUREMENT IN THE TWENTIETH CENTURY 

The twentieth century may be divided up into three more or less 
distinct stages as far as psychological and educational measure- 
ment are concerned. The period from about 1900 to 1915 we may 
designate as the pioneering phasc. This was the period of explora- 
tion, of initial development of methods. It saw the emergence of 
the first Binet intelligence scales and their early American revisions. 
Standardized achievement tests in different subjects began to ap- 
pear, exemplified by Stone's arithmetic tests, Buckingham's spelling 
tests, and Trabue's language tests. Thorndike developed his first 
handwriting scale. Otis and others were initiating work on group 
tests of intelligence. 

The next 15 years, 1915 to 1930, can perhaps be called the "boom" 
period in test development. The pioneers had shown the way, and 
in the hands of enthusiastic followers tests multiplied like rabbits. 
Standardized tests were developed for all the school skills and for 
the content areas of the school program. Achievement batteries 
made their appearance. Starting with Army Alpha of World War 


Re 


at 


6 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


1. group intelligence tests were produced in great numbers. Also 
starting with a wartime product, the Woodworth Personal Data Sheet, 
a whole line of personality questionnaires and inventories came into 
being. 

The rapid development of testing instruments and methods was 
pushed by a group of enthusiasts. They were converts who had 
"gotten the word." Their enthusi 


sm was contagious and extended 
not only to the production of tests but also to their use. Tests of 
intelligence and achievement were. administered widely and some- 
what indiscriminately. Test results were often accepted unhesi- 
tatingly and uncritically and served as the basis for a variety of 
unjustified judgments and actions with respect to individuals. In 
the expansive flood of enthusiasm. for objective measurement, the 
enthusiasts were not too inclined to be critical of their instruments 
or the interpretation of results from them. Many sins were com- 
mitted in the name of measurement by uncritical test users. 

After a while the pendulum began to swing back. More and 
more sharply voiced criticisms of te 


and of the uses made of tests 
began to be heard. Heredity-environment discussions became 
acrimonious. The use of test scores as a basis for classroom group- 
ing became the subject of bitter attack. Criticism was directed. at 
specific tests in terms of their limited scope 
restricted and traditional objectives. 
whole underlying philosophy: 
bers to expre 


and their emphasis upon 
It was also directed at the 
of quantification and the use of num- 
chological qualities. 

The critical attack had the healthy 
thusiasts themselves to become 
and procedures and to broade 
of psychological and educ 
on has, therefore, 
of broadening 


Ps} 


effect of forcing the test en- 
more critical of their assumptions 
n their approach to the whole problem 
ational appraisal. The period from 1930 
been one of critical evaluation, of taking stock, 
techniques and delimiting interpretations, of inte- 
grating the lessons of the half century into a balanced and reasonable 
approach to the appraisal of human behavior. Let us try at this 
point to formulate a philosophy of measurement that will take ac- 
count of these lessons and serve to guide our att 


ack on measurement 
problems and our use of me 


asurement techniques in the vears ahead. 


PHILOSOPHICAL ORIENTATION 

In education and in psychology we аге 
beings. Sometimes we are 
uals, as when we w 


concerned with human 
concerned with them as spec 


ific individ- 
ant to know why Mary 


is having so much diffi- 


KNOWLEDGE AS A GUIDE TO ACTION rA 


culty in learning to do long division. Sometimes we are concerned 
with them as specific groups of individuals, as when we inquire 
whether the children in class А can read as well as those in class B. 
Sometimes we are concerned. with them as general representatives 
of mankind, as when we try to determine whether children with high 
verbal intelligence tend to show more or less signs of emotional dis- 
turbance than children of average intellectual ability. 


KNOWLEDGE AS A GUIDE TO ACTION 


In practically all of education and in much of psychology, our 
concern about individuals is to do something about them, individually: 
or collectively. In so far as it is a science, education is an applied 
science, and the applied aspects of psychology bulk large in the 
present. scene. The educator or the practical psychologist is con- 
unually faced with the necessity of arriving at some decision as to 
a course of action. He must decide what to do about an individual 
or individuals, or he must help the person in question decide upon 
а course of action. He must decide in which grade to place the child 
or what special instruction to provide for him. He must reach a 
diagnosis of the child with a reading disability, with a view to recom- 
mending treatment. He must recommend whether or not to employ 
the job applicant. He must help the student decide whether to go 
to college and, if so, what sort of program to take and what type 
of job to aim for. The educator or psychologist wants each one of 
these decisions to be a sound and well-conceived one. 

Our basic assumption is that sound decisions arise out of relevant 
knowledge of the individual or individuals. We assume that the 
more we know about a person that relates to our present decision, 
and the more accurately we know it, the more likely we are to ar- 
rive at a sound decision about him or a wise plan of action for him. 
By the same token, we assume that the more relevant and accurate 
information we can provide the individual about himself, the more 


likely he is to arrive at a sound decision on his own problem. We 
may need to qualify this assumption as we proceed. There mav be 
limits on the amount and kind of information that can be used in 
a particular situation. We shall indicate that knowledge in and of 
itself is not wisdom. But in its general form the assumption is basic 
not only to educational and psychological measurement but also to 
all science, We assume basically that knowledge i 
knowledge is power, that knowledge is the basis for cfl 
of the problems that confront us from day to day. This is a basic 


good, that 


"e control 


tenet of our faith. 


8 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


What does it mean to "know an individual? Fundamentally, 
to know an individual means to be able to describe him accurately 
and fully. If we know Mary Jones well, we can describe not only 
how she looks—how tall she is and how heavy, the color of her hair 
and eyes, the birthmark under her chin. Much more importantly, 
we can describe what she can and will do—how she will dress, what 
she is likely to talk about, what she will be interested in, what types 
of tasks she can do and how well she can do them, how she will respond 
to stresses and strains of one sort and another. To know a person 
completely means to be able to describe him completely, to predict 
how he will behave in every possible situation. Obviously, we are 
far, far away from this objective, and we always will be. It repre- 
sents the star to which we hitch our wagon. 


IMPORTANCE OF MEASURING THE RIGHT THING 


The effectiveness of our description of any object or person de- 
pends upon two things. It depends upon how wisely we have chosen 
the features to be described and upon how truly and accurately we 
have managed to describe each one. A description may fail to be 
useful for the need at hand because we choose irrelevant features 
to describe. Thus, in describing a painting we might report its 
height, its breadth, and its weight. We might report these with 
great precision. If our concern were to crate the picture for ship- 
ment, these might be just the items of information we would need. 
On the other hand, if our purpose was that of characterizing the 
painting as a work of art, our description would be worthless. The 
attributes of the picture we had described would be 
irrelevant to its quality as a work of art. 

Similarly, a description of a person may be of little 
purposes if we choose the wrong things to describe. 
Force in selecting pilots to fly jet fighters might get very accurate 
information on height, weight, years of education, size of vocabulary, 
and speed of reading for all its applicants. 
find, however, that none of the things that had been described helped 
at all in selecting the men who could successfully learn to fly the 


planes. Such factors as these are in large measure 
success. 


essentially 


value for our 
Thus, the Air 


It would almost surely 


irrelevant to flying 


Again, a high school concerned about 
ciation of its pupils might pre 
the names of the characte 
speare’s Julius Caesar. 
less obvious, but is pro 


assessing the literary appre- 
pare a test inquiring exhaustively into 
rs and the details of the plot of Shake- 
The worthlessness of this procedure may be 
bably just as real as that proposed for the 


THE NEED FOR PRECISION 9 


selection. of pilots. This test seems useless for the task at hand, 
because detailed factual knowledge about an isolated literary work 
is no indicator of the quality of literary appreciation. The test has 
asked the wrong kinds of questions. The evidence it provides is 
related to a faulty interpretation of the question that was asked. 
The first, and perhaps the most important, step in any educational 
or psychological measurement project is defining just what it is that 
we wish to measure and determining what operations will serve to 
measure it. Educational objectives are likely to be incompletely 
formulated and expressed in vague terms. The concepts must be 
clarified and made more specific before we can make much progress 
towards sensible procedures of measurement. Until we can decide 
what is meant by “good citizenship" or what behaviors are exhibited 
by a person who shows "understanding of scientific method," we 
have little prospect of developing procedures to appraise either the 


one or the other. 


THE NEED FOR PRECISION 

Our description may be of limited value in the second place be- 
cause the attributes we elect to describe are described inaccurately. 
'Thus, if our description of the painting were expressed in terms of 
theme, composition, line and volume, and color values, it would 
certainly be a good deal more to the point as an appraisal of a work 
of art, But it would be much more wordy, more subjective, and 
less precise than our previous description of length, breadth, and 
weight. Different persons could be expected to differ markedly in 
the qualities they saw and the terms they used to describe them. 
This might be true to such an extent that a single individual's 
description would give us only a very rough, unclear, and unde- 
pendable impression of the picture as a work of art. 

As for the candidate for pilot training, we might get ratings from 
his friends on his speed of learning new coordinations, ability to pay 
istance to disturbance by 


attention to many things at once, and res 
emotional stress. We may hazard a guess that these ratings would 
again prove ineffective in predicting pilot success—not so much be- 
cause the qualities themselves are unimportant, but because we are 
not skillful in observing such qualities in our fellows or in expressing 
our observations in exact quantitative form. 

Our high school, concerned with literary appreciation, might ask 
each pupil to write a report on some book he had read recently, 
telling what he had liked about it, and why he had or had not thought 
ain, we may feel that such a report would 


it was a good book. А; 


16 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


provide information more related to appreciation than would a test 
of factual knowledge. But judging quality of appreciation shown 
in a varied collection of compositions about an assortment of differ- 
ent books would be a verv subjective enterprise, and the judgments 
would tend to be quite undependable. Each judge would have his 
own personal standards of what constituted good literary apprecia- 
tion. He would make his judgments in terms of those personal 
standards. There would be little agreement from one judge to the 
next as to who had shown good appreciation and who had shown 
poor. Our appraisal would be unsatisfactory because it would be 
inaccurate. 


DEGREES OF REFINEMENT IN MEASURING 


There is enormous variation from one trait to another in the degree 
of refinement we have been able to achieve in describing it. At 
the crudest level, our appraisal may come to no more than a simple 
two-way classification. This may take the form present absent: 
c.g. John lisps but Bill docs not lisp; or the form trait—opposite: 
e.g., John runs fast but Bill runs slowly. 

A somewhat more refined level of description is achieved when we 
characterize the trait by a set of adjectives which represent degrees 
of the trait: John runs fast, Joe goes like a streak, Jack runs 
fairly fast, Will goes like molasses in January. 


But the number of 
such qualitative descriptions is limited, and the meaning of such 
adjectives or similes is far from uniform from person to person. 

A still further level of refinement in description is reached when 
we can arrange the members of a group in rank order with respect 
to an attribute and when we can locate any individual on such a 
rank order. Thus, we may say Joe runs the fastest, John runs 
faster than Jack, Jack runs faster than Will, and Will 
slowest. Such a procedure of rankin 
tended to include all the children in 
the children in the whole country, 
can appraise well enough to produc 
increase in the ade 


runs the 
g could theoretically be ex- 
a class, in a school, or even all 
Clearly, for those traits that we 
ce such a ranking, a very great 
quacy of our description has been achieved 
Finally, some attributes may be expressed in a quantitative state- 
ment of amount. Thus, we may be able to report that Joe ran 100 
yards in 10 seconds John in 14 second ' 
forth. This last is clearly the most 
essential facts and the one that 


ved. 


Jack in 15 seconds, and so 
precise type of statement of the 
makes us best able 


à to decide upon 
an individual, so far as that 
100 vards. 


appropriate action with regard to iction 
{ actic 
depends upon speed of running 


It is certainly the type 


DEGREES OF REFINEMENT IN MEASURING 11 


that the track coach would want to have before deciding whom to 
keep on the track team. 

We have identified four points along a scale of quantification and 
precision of measurement. 


1. Either Or. A pupil is either a boy or a girl. А man is cither 
single, married, widowed, or divorced. A student is enrolled in the 
college preparatory, commercial, or general curriculum, 

2. Qualitatively Described Degrees. Thus, a pupil may show “nor- 
mal speech," "slight stuttering,” “stammer,” "marked stutter.” 
Or the pupils in a class may be characterized as "quiet and relaxed," 
“slightly fidgety,” or "tense and restless." 

3. Rank in a Group. Thus, a series of graded tas 
uniform standards enables us to find who does best апа who does 


scored һу 


worst on reading comprehension, arithmetic problems, or spelling 
The rest of the group can be arranged in order from best to worst. 

4. Amount, Expressed in Uniform Established Units. A boy 
weighs 56 pounds, is 45 inches tall, is 615 vears old. 


This wide variation in the refinement of our appraisals must be 
frankly admitted. Some traits we may never be able to express 
and "not very" characterization, 


more accurately than by a "ven 
Our failure to have achieved greater refinement in Measuring these 
traits is probably partly due to a lack of clarity and sharpness in 
definition of the attribute that we propose to describe. When we 
characterize a person as sincere, cultured, socially adjusted, coopera- 
tive, a good citizen, our hearer may have only a very general idea of 
what we mean. (And, as a matter of fact, so may we.) In part our 
failure is certainly duc to the limited ingenuity and skill we have 
shown to date in finding ways to represent degree or amount of the 


attribute with precision. I may sometimes be partly due to the 
essential nature of a particular attribute, which makes it funda- 
mentally not expressible in quantitative form. There may be some 
things that, in their very nature, can never be quantified. 
Certainly, our present ability truly to measure many of the at- 
tributes of persons that appear to be relevant and important for 
making decisions about them and planning actions with respect to 
them [GEE much to be desired. However, while recognizing this 
fact we must also appreciate that enormous strides have been made 
since 1900 toward more objective and more accurate appraisals of 
human beings. The fact that we are limited in some directions does 
not lessen the value of increased precision where 
While keeping a critical eve upon the 


er such increased 


precision has been achieved. 


12 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


limitations of measurement procedures, we should still use them for 
all they are worth in increasing the accuracy of our information 
about students, employees, or clients. 


CRITICISMS OF PSYCHOLOGICAL AND EDUCATIONAL MEASUREMENT 


Since about 1930, psychological апа, particularly, educational 
measurement have come in for a good deal of criticism. The edu- 
cational philosophers have been especially outspoken in expressing 
their dissatisfaction. In part, the criticisms have been directed 
at the basic logic of psychological measurement. These criticisms 
have been directed at the limitations we have just bee 


n discussing, 
as well as at some other problems concerning 


the equivalence of 
units and scores, which we shall consider briefly in a later chapter. 


In part, however, the criticisms have been directed at the effects 
that the measurement procedures have had upon school practice. 
The following types of criticisms have been made: 


1. Standardized measurement procedures have been said to fos- 
ter undemocratic practices and attitudes in the classroom. 
homogeneous class groups on the basis of an intelligence 
ment test is a specific practice that has been the targ 
cism. 


Forming 
or achieve- 
et of this criti- 


2. It has been contended that standardized tests have had the 
effect of freezing the curriculum and of preventing experiment and 
change, on the grounds that the commercial standardized t 


typically lagged behind the advance of educational thought 
practice. 


est 


and 


3. The limited scope of many stand 
out, and it has been indicated that they fail to appraise many of 
the changes in children that schools should be intereste 

4. The short-answer test items have been accused of producing 


undesirable study habits directed toward piecemeal memorization 
rather than understanding. 


ardized tests has been pointed 


d in producing. 


There has been at least a germ of truth in all these 
Some of them we shall consider in more 
the criticisms are examined, 
marily criticisms of obtaining 
They are criticisms cithe 
of the information yie »rocedures or (b) 


the unwise things that we do with that information. It is as though 
we condemned the physicist yet control the 


weather © construction of atom 


criticisms. 
detail in later chapters. As 
however, we find that they are not pri- 
more information about the indiv 


idual. 
r of (a) the incompleteness 


and imperfection 
Ided by our measurement I 


because (a) he cannot 
and (b) his knowledge has led to th 


SUMMARY STATEMENT 13 


bombs which may destroy mankind. We must grant that our meas- 
urement procedures are incomplete and our actions based on them 
are sometimes unwise. But the remedy lies in developing better 
measurement procedures that will give us more complete and more 
accurate information about the individual. It lies in gaining better 
understanding of our measures—their strengths and their weak- 
nesses—so that we may use them with more wisdom. It does not 


lic in getting less information. 

It cannot be too much emphasized that measurement at best 
provides only information, not judgment. A test will yield only a 
score, not the conclusion to be drawn from that score. The informa- 
tion provided in a test score is not a substitute for insight. This 
information is the raw material with which insight must work, in 
the clinic, in the classroom, and in the research laboratory. E 
perience, training, and basic sagacity must provide the insight that 
will take a set of data about an individual or group, know how much 
faith to place in them and what meaning to give them, and draw from 
them a sound conclusion or plan for action. 

Furthermore, it should be emphasized that the information that 
any measurement procedure gives is limited. It is limited by the 
nature of the measurement instrument itself. The typical intelli- 
gence test, for example, samples certain types of performances with 
abstract ideas expressed in symbolic form. It is not a measure of 
the general worth of the individual, of his ability to acquire mechani- 
cal skills or artistic techniques, or of his integrity and dependability 
as a member of society. The information is limited by the condi- 
tions under which the procedure is applied. Thus, an intelligence 
test given to an emotionally disturbed and resistant child may give 
a very inadequate picture of what that same child could do if the 
disturbing influences were removed and the resistance overcome. 


Learning to use measurement results wisely is in part learning what 
information a particular device does and does not provide and in 
part learning under what circumstances that information is likely to 


be trustworthv. Throughout this book there will be recurring at- 


tempts to guide that learning. 


SUMMARY STATEMENT 
We can summarize much of the foregoing discussion on a working 
philosophy of measurement in the following four points. 


1. The process of measurement is secondary to that of defining 
objectives. The ends to be achieved must first be formulated. Then 


14 HISTORICAL AND PHILOSOPHICAL ORIENTATION 


measurement procedures can be sought as tools for appraising the 
extent to which those ends have been achieved. 

2. Much of educational апа psvchological measurement is, and 
will probably remain, at a relatively low level of precision. We must 
recognize this fact, using the best procedures available to us, but 
always treating the resulting score as a tentative hypothesis rather 
than as an established conclusion. 

3. The more clegant procedures of formal test and measurement 
must be supplemented by the cruder procedures of informal observa- 
tion, anecdotal description, and rating if we are to obtain a descrip- 
tion of the individual that is usefully complete and comprehensive. 

+. No amount of ingenuity in developing. improved 
for measuring and appraising the individual will ever 
need to interpret the results from those procedures; Measurement 
procedures are only tools. Insight and skill are required in the use 
of such tools. The sharper and more varied the tools, the more 
skill it takes to use them most effectively. 


procedures 
eliminate the 


SUGGESTED ADDITIONAL READING 

Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ter 1. 

Anastasi, Anne, and John P. Foley, Jr., Differential psychology: individual 
and group differences in behavior, rev. ed., New York, Macmillan, 1940, Chap- 
ter 1. 


Boring, E. G., A history of experime 
Appleton-Century-Crofts, 1950, 


Freeman, Frank N., Mental tests: their history, prin iples and applications. 
rev. ed., Boston, Houghton Mifflin, 1939, Chapters 1- 2. 


Lorge, Irving, The fundamental nature of measurement, Chapter 14 in 
E. F. Lindquist, Editor, Educational measurement, Washington, D. C., Ameri- 
can Council on Education, 1951. І 


, Monroe, Walter S., Editor, Encyclopedia of educational research, rev, ed. 
New York, Macmillan, 1950, pp. 407-412: 1461-1463. 

‚ Murphy, G., An historical introduction to modern psychology, rev, ed., New 
York, Harcourt, Brace, 1949, Chapters 6, 8, 11,28. and 26 5 


ntal psychology, rev. ed., New York, 


QUESTIONS FOR DISCUSSION 


1. The development of objective 
and further in the United States th 
you see as contributing to this? 


2. Talk to a student from some fore 
tions are like ; 


and how they are used i 
find, as compared with the United St 
advantages of each System? 


and standardized tests h 


А ах proceeded faster 
an in any other country: 


What factors do 
ign country and find out what e 


n his country, 
ates? 


xamina- 
What differences do vou 
What are the advantages and dis- 


QUESTIONS FOR DISCUSSION 15 


3. In many graduate schools oral examinations are still used in examining 
candidates for higher degrees. What are the advantages and disadvantages 
of this type of examination? 

4. From your reading or from your personal experience, give one or more 
concrete examples of the misuse or misinterpretation of the results from stand- 
ardized tes 

5. How universally acceptable is the statement “knowledge is good" in the 
held of education and applied psychology? What objections would уоп have 
to this statenent, or what limitations would vou place upon it? 

6. Give an illustration of a measuring procedure in education or psychology 
that would be of little or no value because it was not sufficiently precise: one 
that would be of no value because it was measuring the wrong thing. 
chological measures to represent 


7. Give two examples of educational or p 
each of the following four points along the scale of quantification and precision 
of measurement: (a) either—or. (b) qualitatively described degrees, (c) rank 
in a group, (d) amount, expressed in uniform, established units. 


Chapter 2 


Overview of Measurement 
Methods 


During the present century techniques for appraising the indi- 
vidual have been developed in great variety, and they have been 
applied to many aspects of his abilities and personality. Specific 
techniques will be discussed in detail in later chapters. The present 
chapter is devoted to a general overview, mapping out some of the 
main landmarks of the whole domain. 


APPRAISAL BY TESTS VERSUS APPRAISAL BY 
OBSERVATION IN NATURAL SITUATIONS 

Attempts to appraise and describe 
two main categories: those that de 
test situations and those that depend upon observing behavior in 
the actual naturally occurring situations of life. The usual earmarks 
of a test are that (1) it occurs at a specified time and place, (2) it 
consists of a set of tasks uniform for cach person tested, and (3) it 
is seen as a test situation by the person being appraised. By con- 
trast, evaluation based upon the naturally occurring situations of 
life is likely to (1) extend over an indefinite period, (2) be based upon 
situations that vary from person to person, and (3) 
as a test by the person being appraised. 
test situations and natural life 
and clear-cut one, and we 
between cases. 


a person can be grouped into 
pend upon setting up special 


not be perceived 
The distinction between 
situations is not an entirely sharp 
will have occasion to consider some in- 
However, it is usually clear whether we are dealing 
with a test as such or with obse 


rvations under the natural condi- 
tions of life. 


In thinking about the evaluation and me 
are likely to think primarily of tests n 
arithmetic, a test of scholastic 
But we must remember 
make of people have 


asurement of man, we 
arrowly defined, a test of 
aptitude, or a test of 
that many of the importar 
always been, 

16 


auditory acuity: 
nt appraisals we 
and will continue to be, based on 


TWO FORMS OF TESTS E 17 


observations of them as they live from day to day. Appraisals of 
the nursery-school child's insecurity in relation to other children, of 
the 10-ycar-old's cooperativeness, or of the junior executive's initia- 
tive will almost necessarily be based upon observations of him over 
à period of time as he functions in his natural social group. Evalua- 
tions based on these observations have serious limitations. We are 
likely to find little uniformity from person to person in either the 
situations observed or the standards of judgment of the observers. 
But for some kinds of behavior we have no adequate tests to substi- 
tute for observations of natural situations—and very likely never will 
have. 

Any complete picture of evaluation procedures must, therefore, 
pay attention both to test techniques and to devices for improving 
the observation of naturally occurring behavior. We will tend to 
prefer test situations where suitable ones can be devised. We have 
more control over the situation, since we can present the same tasks 
or questions to everyone in the same way. We can usually get more 
precise results from a test and results that depend less upon the par- 
ticular person making the appraisal. However, we must recognize 
that many significant aspects of individual behavior, by their very 
nature, defy reduction to a neat test. These can be appraised validly 
only as the individual functions in a natural life situation. 

Of course, not all tests are perfectly frank and aboveboard. We 


shall have occasion to consider various types of test instruments in 
aised are not those that the test seems 


which the characteristics appr А 
to be getting at. Outstanding in this group аге the so-called pro- 
jective tests discussed in Chapter 15. What purports to be а test of 
"imagination" may in fact be directed at revealing anxieties, ten- 
sions, and inner emotional conflicts. Or a test of arithmetic compu- 
tation may be rigged to yield a measure of cheating. But these are 
exceptions to the general rule that in a test the person knows that 


he is being tested and knows what is being tested. 


TWO FORMS OF TESTS 

Within a defined test setting we may again recognize an important 
А i А н cz 1 ap poavos 

distinction, which depends upon whether the а TM 7 

á = * n 5 Е > B ۵ 

permanent record of his behavior or whe ther it must be observec 

"on the wine" as it takes place. The first situation is represented by 
P ^ 8 as = ä . H . = 

of reading comprehension, in which the exam- 

а 
The marks are then permanently 
The second type of test would 


any test, such as one 
Mee marks his answers on a paper. 
recorded and can be scored at leisure. 


18 OVERVIEW OF MEASUREMENT METHODS 


be encountered in an appraisal of oral reading, for example, where 
errors are noted by the listener as they occur or the quality is judged 
bv the listener as the reader speaks. 

f In this comparison, again, the advantages with respect to reliabil- 
ity and objectivity usually fall on the side of the test that gives a 
permanent record, the test with answers on an answer sheet or a 
definite product produced. It is hard to observe and record be- 
havior accurately as it is taking place. Inaccuracies and biases tend 
to creep in. The observer is hurried; his attention lapses. Conse- 
quently, in developing testing devices the tendency has been to make 
them of the sort that leaves a permanent record. 

But young children cannot read or write, and many others are 
handicapped in a test that requires them to do so. Again, some 
types of performances, such as speaking or singing, are not readily 
reduced to a usable permanent record. It is also true that sometimes 
we are interested not merely in what a person does but also in how 
he does it. If a child gets the right answer for 6 X 7, does he get it 
quickly or slowly? Surely or with fumbling? By automatic habitua- 
tion of the correct answer or by counting up from 6 x 6? The 
process does not show in the written answer but can sometimes be 
observed if the child answers the problem orally or "thinks out 
loud." 

There are test situations, therefore, in which we shall have to 
depend upon observations of the behavior as it takes place rather 
than upon scoring the written record. "These test situations pose 
special problems. Observers must be taught what to look for. They 
must be taught what responses to record and how to record them. 
They must be trained in standards of judgment, so that the pronun- 
ciation that they accept as right, for example, will also be one ac- 
cepted as right by other observers. It is for this type of test that 
special training of examiners is usually required. — 


EXTERNAL OBSERVERS VERSUS SELF-OBSERVATION 


Аз we move out of a test setting into observ 
ual's behavior in the natural situations of life 
are again open to us. We may rely upon enn 
the subject's behavior, someone such as his te 
a friend, or a member of his family. i 
on his own characteristics as he 5 


ation of the individ- 
two distinct options 
outsider to observe 
acher, his employer: 
Or we can ask him to report 
: sees them. These ride tw ite 
different views of the individual, the v چ ا‎ p i vci 
from the inside, an 


PLANNED VERSUS RETROSPECTIVE OBSERVATION 19 


The outside view is filtered through the biases and limited con- 
tacts of a particular outsider. The teacher, for example, sees only 
one side of the youngster—the school side that is turned toward 
him. Furthermore, he secs it colored by his own prejudices and limi- 
tations. What he sees аз "cooperation" тау from another view- 
point appear to be docilitv; what he considers “insubordination” 
may appear to another to be independence. 

The self-picture is limited by the reporter's lack of self-understand- 
ing and unwillingness to reveal himself to the watching world. We 
do not know ourselves perfectly. Some of our limitations, our petty 
meannesses and evasions, our weak and sensitive spots, we cannot 
face and admit even to ourselves. Still other shortcomings we are 
aware of but are unwilling to acknowledge to an outsider. 

Sometimes one set of limitations will seem more serious, sometimes 
the other. If a person is applying for a job he very much wants, 
we will probably feel that we can put more trust in the evaluations 
of outsiders than in his self-evaluation. He has too much at stake 
in the impression he makes. On the other hand, if he has come to 
us for help and guidance, his own more intimate elf. picture may 
provide a better basis for counseling with him than will the impres- 
We shall need to become acquainted with 


sions of an outsider. 
evaluation instruments of both types. 


PLANNED VERSUS RETROSPECTIVE OBSERVATION 


When we rely upon observations, either by the subject himself 
or by others observing him from outside, we may call for new obser- 
vations made specially for us, or we may fall back upon the informal 
and undirected observations that have occurred in the past. Sup- 
Dose we are studying the individual's tendency to become angry. 
rie p track for us of all the times he got mad 
noting down the circunistances for cach 
i what precipitated it, what he 
anned self-observation. 


We might ask him to kee 
during the following week, 
anger episode, i.e, when it qué fp 
did, ete. This would be an example o : ёзу ar fis 81 
By: contrast, a second possibility would be to ee ge a Sige eget 
tions that tend to annoy or irritate — Y tad * 
him to look into himself and judge how readily a ee sailed by 
set angry at people who push in front in — 9 n 
the wrong name, at being called down for some d dites D " va 
forth. The self-observations would now. Ве epe be à; "P 
scher were doing the job. he might be as ce he 
a teacher wer, riod when he saw the particu- 


Outsider- вау, 
м : ¬ «pecific Eu 
Note down times during а specified 1 


20 OVERVIEW OF MEASUREMENT METHODS 


lar pupil push, hit, or talk sharply to another. Or he might be asked 
to think back over his contacts with the child and rate him on a 


scale ranging from “exceptionally calm and even-tempered' to 
“flares up and gets angry at the slightest provocation.” 

Again, there are advantages and disadvantages to both the planned 
and the retrospective type of observation. A major difficulty with 
systematic planned observations is that they are laborious and time- 
consuming. It takes a great deal of time and a high level of observer 
cooperation to get the necessary observations made. Partly because 
of this, the observations are likely to cover a quite limited time 
period and therefore to represent a rather meager sample of the in- 
dividual's behavior. However, when observations are of actual 
current behavior, they should be more objective and less influenced 
by biases and the selective effects of memory than retrospective re- 
ports. The retrospective observations called for in self-report in- 
ventories and in rating scales have been widely used because of their 
administrative simplicity and because they summarize concisely the 
whole history of self-observation or contact with the person rated. 
But this type of summarizing judgment gives the biases of the 
respondent the fullest chance to express themsclv To reduce 
the effect of over-all biases and crude general impressions becomes 
a major undertaking. 


OBSERVATION AND TEST COMBINED—THE 
SITUATIONAL TEST 


As we noted earlier, some behavior in test situations leaves no 
record behind but must be observed as it occurs. Here we have 
something of a hybrid involving both observation and test. The 
observer notes the specific errors a child makes when he re 
or his hesitations and false starts in spelling a word. Sometimes 
the "test" may involve a much more complex and total situation 
and more subtle types of behavior. In many of these "tests," the 
person being observed may not realize what is being 
even that he is being observed). i 


ads aloud 


observed (or 
So, if we want to appraise the 
individual's tendency to get angry, we may put him in a standard 
anger-producing situation. For example, we may give him a job to 
do and two intentionally stupid assistants who keep making mis- 
takes and getting in the way. In so far as we are able to present 
each subject with the same task, we have a test situ 
must depend upon the observations and judgm 
evaluate his behavior. 


ation. But we 
ents of outsiders to 


FUNCTIONS FOR MEASUREMENT 21 


These complexly structured lifelike situations, which strive for 
the uniformity of a test situation and yet for the naturalness of real- 
life events, may be called situational tests. They represent a com- 
promise between the objectivity and standardization of the testing 
approach and the naturalness of a real-life situation. This approach 
presents interesting possibilities for getting at types of behavior 
that do not readily lend themselves to the conventional types of 
testing. 

The practical problems faced in devising situational tests are very 
call for elaborate staging if the naturalness of real life 
In addition, the problems of obtaining satis- 
and adequate reports of them remain. For 
ests have not been widely used. But they 
possibilities are only be- 


great. They 
15 to be preserved. 


factory observations 
these reasons, situational t 
present an interesting tvpe of tool, whose 
ginning be explored. 


OR WHICH MEASUREMENT HAS 
BEEN UNDERTAKEN 


Broadly speaking, psychologists and educators have been inter- 
ested in measuring in two general areas, what a person сал do and 
what he will do. “Measures of the first sort are measures of ability. 
In our discussion we will divide ability measures into measures of 
aptitude and measures of achievement. Again, roughly speaking, an 
aptitude test undertakes to measure what a person could learn to 
do, whereas an achievement test measures what he has learned to do. 

The distinction between aptitude tests and кше tests is 
far from a clear one, because we often use what a person has learned 
can learn. Thus, a measure of the amount 


evices a person has gained in the past 
may be one of the most accurate indicators of the е ioe 

) i i e > future. ле 
knowledge of things mechanical he will gen in 195 . 
clearest distinction between aptitude and achieveme бур 
the direction of our interest. In an aptitude test, ou 


۷ r ^velo into in the future; 
predict what the individual can learn 9 de l р : , ri s 
i st our interest 15 in what he has learned 
nt te 


FUNCTIONS F 


аз а cue to what he 
of knowledge of mechanical d 


in the achieveme 


the past. 


ior categor 
Measures of the second major page bh label personality meas- 
2 ге may E 2 А 
do correspond to the area Wo ) 


efiniti ersonal- 
u higi mew and loose definition | регво F 
rement. This is a 50 That is, we have itdi- 
i d rhe extern a ' 
"y. Tt is also d чог y an for how he feels 
cated a concern for what а person d r the 
с 


v—of what the person will 


hat broad 
al one. 


oes rathe 
at 


22 OVERVIEW OF MEASUREMENT METHODS 


or what his inner urges and conflicts are. We тау be interested in 
those to a degree. But, so far as a testing or observational pro- 
cedure is concerned, it is always based on what a person does—how 
he acts, what answers he marks, or what he says. His actions are 
the basic material that we study. 

In the long run, his future actions are what we want to predict: 
whether he will graduate from college, whether he will continue in 
and apply himself to a clerical type of job, whether he will behave 
in a more socially acceptable fashion after a particular type of ther- 
apy. We may perhaps make these predictions more surely if we or- 


ganize the test and observational appraisals around certain concepts 
of interests, needs, or conflicts. But these terms describing the inner 
life of the individual represent inferences that we make as a way of 
structuring and organizing the observations of the individual's be- 
havior. We cannot see a need for approval. What we observe is 
that a child brings things into class, attempts to talk at all times, 
buys candy for other children, and tries to join any social group in 
the playground. We may infer a need for approval as an underlying 
factor related to the various behaviors. 

When we try to measure what a person will do, as distinct from 
what he can do, we encounter some special problems. These are 
primarily problems of intentional distortion of the test results, In 
an ability test we want cach individual to trv hard and do the best 
he can. But in personality measures, we do not want to know how 
cooperatively a person can behave or how energetic he can be. We 
want to know to what degree he typically does show energy or behave 
in a cooperative manner. [n a limited test situation, where the 
nature of the test is clear to the examinee, everyone can put his 
best foot forward. He can probably muster up all the virtues for 
a special occasion. But will he in other situations? It is this ques- 
tion, the question of the degree to which behavior in an identifiable 
test situation will represent behavior in real life, that pushes us into 
disguised tests and into observational evaluations of personality 
characteristics. 


ASPECTS OF PERSONALITY 


It will be convenient to use a number of terms to refer to certain 
fractions or aspects of personality that we may wish to evaluate. 
These terms and the meanings that attach to them are 
briefly below. 

Character. Character traits are aspects of individual behavior to 
which a definite social value has been attached. 


discussed 


Honesty, coopera- 


ASPECTS OF PERSONALITY 23 


tiveness, thrift, kindliness, and loyalty are all labels for social vir- 
tues. Educational and religious organizations have alwavs been 
concerned with the inculcation of such virtues. Based on this con- 
cern there have been developed a number of evaluation procedures 
that we shall refer to as measures of character. 

Adjustment. Educators and psychologists have long been con- 
cerned with the concept of adjustment. The mental hygiene ap- 
proach as applied both in and out of school has striven to develop 
“well-adjusted personalities." Maladjustment is recognized in in- 
dividuals who fail to fit into the social group or who appear to live 
unhappy and unproductive lives. As with character, degree of ad- 
justment represents a social judgment, and what is conceived to be 
well-adjusted behavior varies from one culture to another, depending 
upon what is normal for that culture. Normal behavior in our 
competitive, acquisitive society might seem pathological if trans- 
ferred to a South Sea island. Adjustment will mean, then, behavior 
Patterns that enable the person to get along in and be comfortable 
in his social setting typically, the setting of middle-class, twentieth- 
century American-European culture. We shall encounter a group 
of instruments designed to evaluate deviations from this norm— 
the tendency to show maladjusted behavior or behavior typical of 
people who do not get on happily and successfully in our culture. 

Temperament. From early days observers of human nature have 
Noted conspicuous differences in energy level, cheerfulness or the 

Literary 


5 


reverse, aggressiveness, irascibility, and similar qualities. 
men and men of science alike have proposed systems for classifying 
temperaments. Hippocrates, for example, proposed that men could 
be divided into the sanguine (energetic and cheerful), choleric 
(energetic and irascible), phlegmatic (sluggish and placid), and 
ish and sad), and proposed physiological bases for 


Melancholic (slug ical b 
There have been many other classifications be- 


these distinctions. : 
fore and since. Appraisals of such dimensions as these we shall 
Speak of as measures of temperament. . | 

Interest. The individual makes a variety of choices with respect 


to the activities in which he engages. He shows preferences for some, 
aversion to others. Appraising these tendencies to seek or avoid 
Particular activities constitutes the domain of interest measurement, 

Attitude. The individual responds with enthusiasm and aversion 
not only toward activities but also to social groups, social institu- 


lions, and the other aspects of his world. These reactions, with their 
Various ramifications, constitute the individual's constellation of 
attitudes. A variety of devices have been developed for evaluating 


24 OVERVIEW OF MEASUREMENT METHODS 


these prejudices pro and con, and these constitute the field of atti- 
tude measurement. 


CONCLUDING STATEMENT 


In summary, then, approaches to the measurement of the individ- 
ual cover a great diversity both of methods and of content areas. 
Variations of method may be represented by the following outline: 


I. Test methods, involving a defined task and testing period. 
A. Permanent record or product available for scoring or analysis. 
B. Process must be observed and evaluated as it occurs. 
I1. Observational methods, in which behavior is observed in the natural 
situations of life. 
A. Self-observation, in which the individual reports on his own reac- 
tions, as far as he is aware of them. 
1. Planned observations, planned in advance to cover a specified 
period. 
2. Retrospective observation, based on present memory and evalu- 
ation of past reactions. 
B. Observation by an outsider, in which relative, employer, teacher, etc. 
report on the individual's reactions. 
1. Planned observations. 
2. Retrospective observations. 
III. Mixed methods, characterized by some of the aspects of a test but also 
relying upon observation and evaluation of observed behavior. 


Advantages and problems of these approaches have been sketched 


in but will need to be considered in more detail as specific methods 
are claborated in later chapters. 


Aspects of the individual for which evaluation procedures have 
been developed and in which we shall be interested include the fol- 
lowing: 

I. Abilities, evidences of what the individual can do if he tries. 

A. Aptitudes, performances serving as indicators of what he can learn 
to do. 

B. Achievements, performances used to show what he has learned to do- 

7] ш TM . n . " . tn ` я 

Personality variables, indications of what an individual will do, of how 

one will respond to the events and pressures of life. 


A. Character, certain qualities defined by society as estimable or the 
reverse. ў 

В. Adjustment, degree of ability to fit into and live happily in the cul- 
ture in which one is placed. н 

€. нее qualities relating to energy level, mood, апа style of 
ife. ` 

D. Interests, activities that are accepted or rejected 

E [ 


‚ Attitudes, reactions for or against the people, the phenomena, and 
the concepts that make up society. 


QUESTIONS FOR DISCUSSION 25 


This analysis of aspects of the individual is neither complete nor 
detailed. However, it serves to indicate the range of measures with 
which we shall be concerned in the following chapters. 


SUGGESTED ADDITIONAL READING 
Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ter 2. 
QUESTIONS FOR DISCUSSION 


d that personality measures are less satisfac- 
What factors give rise to 


1. It would generally be agree 
tory than measures of aptitude or achievement. 
this? 

5 Ее " — 

2. How would you fit cach of the following into the cl 
ment methods given in the chapter? 


assification of measure- 


a. Anecdotal records kept by a teacher. describing behavior in his class- 


room. 
b. An autobiography written by a pupil for a high-school counselor. 
с. An individual intelligence test in which both questions and answers are 
given orally. 
d. A Boy Scout's record of “good deeds,” kept ov 
reported to his Scoutmaster. 


er a 2-week period and 


3. Illustrate, from your reading or experience, each of the categories of 


measurement methods in the outline on p. 24. 
4. How would you fit each of the following into the outline of aspects of 


the individual to be evaluated, given on p. 24? 
a high-school student gets along with adults. 


annotated book titles. 
a test of readiness to learn reading. 


used to place him in the ap- 


a. Observations of how well 

b. A pupil's choices from a list of 

с. А kindergarten child's performance on 

d. A pupil's performance on an English test, 
propriate section. 

e. Ratings of a pupil on his loyalty to his friends. 


rsonal experience, give an illustration of measure- 


aspects of the individual identified in the out- 


From your reading or pe 
ment procedures for each of the 
line on p. 24. 

6. A class has just finished a uni 
evaluate the effectiveness of the unit. 
p. 24 might she use? What would be the 


each? 


t on etiquette. and the teacher wishes to 
Which of the methods outlined on 
advantages and limitations of 


Chapter 3 


The Teacher's Ф 


In this book dealing with educational and psychological measure- 
ment procedures, we have elected to start with a consideration of the 
teacher's own tests. We have done this for several reasons. In the 
first place, informal test making is an operation that is familiar to 
every teacher, and the outcomes of that test making are familiar to 
every student. In the second place, the techniques available to every 
teacher form the backbone of standardized tests of achievement and 
of aptitude. Finally, a first look at the teacher's own tests can be 
taken without the statistical concepts that we shall need to develop 
as a basis for appraisal of commercially distributed tests. 


THE IMPORTANCE OF TEACHER-MADE TESTS 


Evaluation of pupil progress is a major aspect of the teacher's 
job. A good picture of where the pupil is and of how he is progress- 
ing is fundamental to effective teaching by the teacher and to effec- 
tive learning by the pupil. The evaluation * procedures the teacher 
uses with his group serve a number of functions. We will identify 
four, commenting bricfly upon cach of them. All the procedures the 
teacher develops for pupil evaluation may serve these functions, but 
we shall be concerned to point out how they may be served by the 
more formal evaluation instruments called tests. 


MOTIVATION 


To some degree, varying from pupil to pupil and from class to 
class, tests determine when students study, what they study, and 


* The term “evaluation” as we use 


it here is closely related to measurement. 
It is in some respects more inclusive, including informal and intuitive judgments 
of pupil progress. [t also includes more definitely the aspect of v 


aluing— of saving 
what is desirable and good. (€ 


зоо measurement techniques provide the solid 
foundation for sound evaluation, whether of a single 


26 


pupil or of a total curriculum. 


DIFFERENTIATION AND CERTIFICATION OF PUPILS 27 


how they study. Tests that are well constructed and effectively 
used can motivate students to develop good study habits, to correct 
errors, and to direct their activities toward the achievement of de- 
sired goals. Tests that are poorly constructed or used punitively can 
Just as effectively discourage the students or misdirect their learning. 
Testing procedures control the learning process to a greater degres: 
perhaps, than any other teaching device. 


DIAGNOSIS AND INSTRUCTION 
aknesses and to provide practice for 


Testing serves to diagnose we 
Тһе items on which an individual 


available knowledges and skills. 
fails or on which many members of a class group fail can serve to 
identify points needing further study whenever the test task is suffi- 
Ciently precise for the nature of the failure to be identified. The 
function of a test as a rehearsal of knowledge and a guide for further 


Study has long been recognized. 


DEFINING TEACHING OBJECTIVES 
. What a teacher emphasizes in his evaluation of pupils, and par- 
ticularly in the more formal evaluation represented by tests, defines 
ıt teacher considers important. This defini- 
much more forceful way than any pretty 
The teacher may avow, to his 


to his students what tha 
ton is presented in a 
Speeches that the teacher may make. 
ues, that he considers the ability to apply 


Students or to his colleag 
and basic principles to be much 


facts to real situations and to underst 
more important than just learning facts. But if his tests ask only 
and sentences from the book or his lectures, 
ctives, and those will be the things 
docile ones who are influenced by 


for names, dates, places. 
those will be his functional obje 
that his students will stuay—the 
him anyhow. We may know a teacher by the tests he makes. They 
tell what he is truly valuing in his pupils, even though he himself 
does not know it, and they influence profoundly what his students 


will learn, 


DIFFERENTIATION AND CERTIFICATION OF PUPILS 
The teacher inevitably has a responsibility for certifying pupils’ 
accomplishments to higher levels of the educational enterprise and 
to the world outside the school. The testing procedures he uses help 
him to arrive at the judgment that is recorded in his mark, letter of 
recommendation, or other evidence of approbation or disapprobation. 
ctions they serve and in view of the dis- 


In view of the many fun \ 
pupil from poorly conceived or exe- 


Service that may be done the 


28 THE TEACHER'S OWN TESTS 


cuted evaluation instruments, it is important that the teacher's 
evaluation devices be well thought out and well made. To evaluate 
the range of outcomes in which a modern school is interested—under- 
standing as well as knowledge, appreciation as well as skill, ability 
to apply as well as to reproduce, attitudes and interests as well as 
achievements—the teacher must call upon a variety of types of ap- 
praisal. He must profit from observation of classroom performance, 
by recitation, by participation in informal discussions, by contribu- 
tion to group enterprises. He must size up the student in confer- 
ence, interview, and informal discussion. He may have occasion to 
rate the products produced in laboratorv or shop and to appraise 
the quality of assignments carried out outside of school. He will 
also almost certainly make some use of class tests. Some of the ob- 
jectives of his teaching can be measured efficiently, realistically, and 
completely by pencil-and-paper tests. Some can be measured only 
partially by such means. Some cannot be measured at all in this 
way. This chapter and the next are concerned primarily with those 
objectives that can be measured with tests and with the improve- 
ment of testing procedures to measure them. Some consideration 
will be given to observational procedures, ratings, and other types 
of appraisal devices in later chapters. 


PLANNING THE TEST 


The primary function of any evaluation procedure is to determine 
to what extent students have achieved the objectives of instruction. 
If a test is to serve this function effectively, it must be planned with 
that end in view. A test which “just growed” is unlikely to corre- 
spond very well to the teacher's stated objecti j 


s. This is particu- 
true. ‚ and it is here that careful 
planning is especially important. However, one should not over- 
look the importance of a good test plan even in the case of an essay 
test. 


[71 


larly true in the case of objective tes 


If the teacher just sits down and writes objective test items, the 
test is likely to be out of balance. It is easier to write simple factual 
items than it is to write items that call for understanding of generali- 
zations or application of principles. It is easier to write items on 
some topics than on others. As à result, the teacher is likely to end 
up with an overload of items calling for simple information about 
the more testable topics. 


The same thing is true, to a degree, of 
essay tes 


The outcomes measured by the test will then show a 
poor correspondence with those espoused by the teacher. What the 


OBJECTIVES OF A UNIT ON LABOR RELATIONS 29 


pupils emphasize in their learning will soon follow what they find is 
zed in their tests, and the tests will fail to foster the learn- 


empha 
ings in which the instructor is most interested. 


DEFINING OBJECTIVES 
anning of a test involves several steps. The first 


The thoughtful pl 
ine the objectives that are to be 


and most important step is to det 
appraised. This may have to be done by the teacher himself, per- 
haps assisted by his colleagues. working from his text or course out- 
linc. However, if the teacher's course follows a published course of 
study, some formulation of general objectives is usually already 
available there. Even in this case, however, additional work on 
defining objectives is often needed. Course of study objectives are 
likely to be stated in broad terms and to need to be broken down 
into more specific components that provide a more exact definition 
int by the broad objective. Furthermore, if the 
d in test making. they must be translated 
We must be able to indicate the skills, 
attitudes that a pupil should show 


of just what is mec 
Objectives are to be usc 
into terms of pupil behavior. 


knowledge, understandings, and 
objectives of instruction. 


In the section below are listed 
d outcomes for a unit on 
of study entitled Our 


if he has really achieved these 
Let us look at an actual example. 
stated as the desire 


part of a course 
nth-grade social studies cl 


the objectives that were 
labor relations included as 
Industrial World intended for an cleve 


OBJECTIVES OF A UNIT ON LABOR RELATIONS 


ow labor organizations developed. 
zations and their effect upon later ones. 
ions of labor unions. 


1. To understand why and h 
2. To know the early labor organi 
3. To understand the structure and funct 
4. To understand the terms used in labor relations. 
5. To understand the important legislation regarding | 
ganizations. 
6. To understand the advantages 
7. To develop an appreciation of the 
our democratic society. 
8. To be able to locate 
9. To develop the habit o 
labor and management. 
10. То recognize the 


abor and labor or- 


and disadvantages of laber unions. 


place and function of labor unions in 


and interpret data. 
{ critical evaluation 


of information concerning 
dignity of labor. 

t to the content of the unit 
at the pupil is expected to 
it is almost essential 


As stated, these objectives refer in par 
and in part to the processes or activities th 
display. In thinking about the plan for a test. 


30 THE TEACHER'S OWN TESTS 


that these two aspects be separated. What are the processes or ac- 
tivities that are called for in this set of objectives? We can perhaps 
identify the following as distinct activities in which student gains 
are hoped for as a result of their experience with the unit. Note 
that none of these makes reference to the specific content involved. 


1. Understand the meaning of important terms and concepts. 

2. Possess a store of basic factual information. 

3. Understand developmental sequences and causal relationships. 
4. Understand broad principles and generalizations. 

5. Apply principles and generalizations to new situations. 

6. Be able to locate and interpret relevant information. 

7. Exhibit habits of critical evaluation of evidence 

8. Show socially desirable attitudes and appreciations. 


A number of these are still expressed in terms a little general to 
serve as a guide to item writing. For example, in 6, "Be able to 
locate and interpret relevant information," we are not told what 
kinds of information the student is supposed to be able to locate 
and interpret or how extensive an interpretation is expected of him. 
If we stated this objective in terms of more specific things the stu- 
dent is expected to do, we might break it down as follows: 


1. Use the library card file to obtain sources of information. 

2. Use an index to locate information in books and en 'clopedias. 

3. Use the Readers’ Guide to Periodical Literature to locate recent magazine 
articles or publications. 

4. Read and get information from bar, line. 
and from simple tables of numerical data. 


picture, and. circular. graphs 


5. Find relationships and make inferences from graphic or tabular material. 
6. Recognize the limitations of a particular set of data 
Of course, these behaviors are not the only ones the student might 

be expected to show if he had achieved this objective of instruction, 

but they do show the tvpe of further analysis that 
put objectives into the form in which they can be 
test construction. : 


is necessary to 
a direct guide to 


OUTLINING CONTENT 


The second step in planning for a test is to outline the content to 


be covered. The outline of content is important bec 
is the actual vehicle through which the proce 
achieved. An outline of conte 


ause the content 
ss objectives are to be 
nt for our illustrative example follows. 


L 


III. 


VI. 


OUTLINE FOR CONTENT OF A UNIT ON LABOR RELATIONS 31 


OUTLINE FOR CONTENT OF A UNIT ON LABOR RELATIONS 


The Growth of Organized Labor. Effect of the Industrial Revolution 
on Conditions and Methods of Work. 
A. Beginnings of factory system. 
B. Status of labor during Industrial Revolution. 
C. Growth of cities and commerce. 
D. Age of laissez-faire. 
E. Growth and development of the guilds. 
Rise of National Unions and Early Efforts toward Organization. 
A. Early unions in the United States. 
1. Trade unions immediately after the American Revolution. 
2. The National Labor Union. 
3. The Knights of Labor. 
B. Opposition of employers to labor unions. 
1. Economic position of employer in contrast to that of employee. 
2. Use of injunctions. 
3. Use of“ vellow-dog" contracts. 
4. Use of black lists. 
С. Decline of early labor organizations. 
The American Federation of Labor and the Congress of Industrial 


Organizations. 
A. Organization of the American Federation of Labor. 


B. Policies of the AF of L. 

C. Growth of opposition to the 

D. Decline of the effectiveness o 
E. Effect of the depression on labor. 
F. Formation of the Congress of Industrial Organizations. 
G. Struggle for power between the AF of L and the CIO. 
Labor and the Employer. 

A. Collective bargaining. 

B. Strikes. 

C. Open and closed shops. 

D. Boycott. 

E. Union label. 

F. Lockout. 

G. Strike breaking. 

Labor Legislation. 

А. The Nor LaGuardia Act. 

B. The National Labor Relations Act. 

C. The Walsh-Healy Act. 

D. Fair Labor Standards Act. 
E. The Taft-Hartley Act. EN 
Structure and Functions of Ut 
A. Kinds of unions. 

B. Composition of labor movement. 
C. Problems in labor organization. 

D. Union benefits and servici 
E. Responsibilities of labor. 


AF of L among labor. 
г labor unions after World War 1. 


ions. 


32 


THE TEACHER'S OWN TESTS 


7 
PREPARING THE TEST BLUEPRINT 


Content outline and statement of types of objective represent two 
dimensions into which a test plan should be fitted. These two dimen- 
sions need to be put together to give a complete framework and to 
вее which objectives relate especially to which segments of content. 


Part of a chart showing the two in relation to one another is shown in 


Fig. 3.1. 


It is rather crowded. because of space limitations, and in 


1. Terms and con- 
cepts 


2. Bas 


Conter 


1t 


Growth of Organized Labor 


Rise of National l 


Guild; la 
of living. 


z-faire; standard 


Similarities and differences 
between guilds and modern 
labor unions. 


"nions 


3. Understand de- 
velopments and 
causes 


Reasons for changed relation- 
ships betweca employer and 
employee as result of Indus- 
trial. Revolution. 


4. Understand prin- 
ciples and gener- 
alizations 


Effect of economic conditions 
on the bargaining power of la- 
bor. 


E 


. Apply principles 
and generaliza- 
tions 


Reasons for the low percent- 
age of the working force who 
belong to unions. 


6. Locate relevant 
information 


Use encyclopedia to obtain 
information about Industrial 
Revolution. 


7. Critical evalua- 
tion of evidence 


Is able to identify pro-labor 
or anti-labor slants in selected 
passages on labor. 


8. Attitudes and 
appreciation 


Appreciation for advantages 
that have come about as re- 
sult of Industrial Revolution. 


Fig. 3.1. 


Section of chart for test blueprint. 


PREPARING THE TEST BLUEPRINT 33 


actual practice it would be desirable to have more space for notes in 
Each cell can be filled with notes suggesting 


each cell of the grid. 
applications, or appreciations that 


the specific facts, understandings, 
are important outcomes for that content and that objective. For 
example, the upper left-hand cell might list such terms as guild, 
laissez-faire, Industrial Revolution, and standard ofliving. The cell 
below that might list what the test maker considered the main facts 
about the nature of guilds, the conditions of work before the Indus- 
trial Revolution, standards of living and hours of work in the early 
nineteenth century, and so forth. The other cells would be filled in 
in the same way. The preparation of such а two-dimensional out- 
line is undoubtedly an exacting and time-consuming task. The busy 
classroom teacher may often fall short of such a complete analysis. 
There is no question, however, that it will go far toward clarifying 
the objectives of a particular unit and toward guiding not only the 
preparation of a sound test but also the teaching of the unit itself. 
Some of the objectives the teacher is trying to achieve may be of 
aised at all by a paper-and-pencil test or 
may not be ones that the teacher wishes to evaluate in an instru- 
ment dealing with a limited segment of content. Our seventh objec- 
tive, "Exhibit habits of critical evaluation of evidence," and perhaps 
Our eighth, "Show socially desirable attitudes and appreciations,” 
May well not be testable with a written test. It would be possible 
to construct items to see if the student is able to evaluate evidence 
critically, i.e., if he can identify bias in information, detect wrong 
conclusions or conclusions drawn from insufficient data, and similar 
skills, but it would be almost impossible to determine by a test if he 
habitually does react critically. Similarly, we could identify knowl- 
edge of and lip service to certain attitudes but probably could not 
determine whether the attitudes were really accepted by the student. 
The sixth, "Be able to locate and interpret relevant information," 
may not be something the teacher wishes to test at this time or in 
relation to this unit. We cannot do everything with one evaluation 


medium and at one time. 
Once the basic outline 
decide upon the relative emphasis to 


a sort that cannot be appr 


has been prepared, the test maker must 
be given to the several content 


Thus, 20 per cent of the time or items might 
pts, 30 per cent to basic facts, etc. 
Again, 10 per cent might be allocated to Growth of Organized Labor, 
15 per cent to Rise of National Unions, and so on through the dif- 
ferent topics. Allocation of proportions to the two sets of marginal 
Categories will suggest how much weight should be given to the in- 


areas and objectives. 
be allocated to terms and conce 


34 THE TEACHER'S OWN TESTS 


dividual cells of the outline. Weighting of topics and objectives is 
best done in this wav—by manipulating the time or number of items 
on a given topic. 

The test maker must also decide now whether he will use essav 
test questions or short-answer, objective items, and if he decides on 
objective items he must decide which type or types he will use. The 
choice is governed, at least in part, by the objectives to be measured, 
The next section of this chapter will provide some discussion of the 
advantages and disadvantages of essay questions, in relation to the 
objective type of item. Different types of objective items will be 
described and compared in Chapter 4. 

At about this point the total number of essay questions or objec- 
tive test items must be decided upon. This is primarily a function 
of the time available for the test and the type of items being used. 
Different types of objective items differ in the time allotments they 
require, and, of course, an essay question demands a great deal more 
time than an objective item. It is almost impossible to state in 
general terms how much time should be allowed per item for objec- 
tive items of a specific type. The appropriate time allowance is 
affected by a host of different factors. Among the most important 
are (1) the age of the pupils being tested, (2) the length and com- 
plexity of the item, (3) the type of objective being tested —knowl- 
edge of fact or concept versus application to new situation, (4) the 
amount of computation, if any, required by the item, and (5) the 
relative interest of the examiner in speed versus power—the amount 
the pupil can do with unlimited time. In general, it seems unde- 
sirable to emphasize speed in any test designed to measure depth of 
understanding and ability to apply one's knowledge. We suggest 
that the amount of material included be kept down to the amount 
that can be covered comfortably by most members of the group 
being tested. 

The total number of items that will fit into the available time and 
the weights assigned to different topics and objectives will determine 
the number desired for any cell in our test plan. Some flexibility 
should be kept in the allocation, of course, because of difficulty of 
writing acceptable items on certain topics. | 

A final decision that comes in as part of the preliminary planning 
concerns the desired difficulty of the test items. The decision de- 
pends in part upon the purpose of the test. When the test is to meas- 


ure mastery of the basic essentials in an area, the questions should 


be limited to basic essentials. If the unit has been well taught, all 


the items may then turn out to be very easy for the group. When 


THE ESSAY TEST 35 


the purpose of the test is to discriminate levels of achievement of 
different members of a group, i.e., to serve as a basis for ranking or 
grading, some items should be very easy, most of them should be of 
moderate difficulty, and a few should be difficult enough to spread 
out the ablest members of the group. Difficulty, in this context, 
means the percentage of examinees who get the item wrong. 

Our test plan now consists of: 


1. An outline of content and objectives. 

2. Specific suggestions of what might be 
bination of content and objective. 

3. An allocation of per cents of the total test by content area and 
of the total number of items. 
m difficulties. 


covered under each com- 


by objective and an estimate 
4. Specifications for the spread of ite 


the actual test items. In the remain- 


The next task is to prepare 
choice of item types and 


der of this chapter we will discuss the 
guides for improving the writing and use of essay questions. In 
Chapter 4 we will discuss guides for improving and using objective- 


type items. 


ESSAY VERSUS OBJECTIVE TESTS 


Teacher-made tests may be divided into two broad categories, 
sts and objective or “new-type” tests. Each 


essay or free-answer te 
and limitations, and each has its place 


of these has its advantages 


in the measurement of classroom achievement. 


THE ESSAY TEST 


The essay test consists of such problems as: 


Compare the craft guilds of medieval Europe with the two major labor 
unions that exist in the United States today. b А 
What were the reasons for the decrease in union membership after 
World War I? "n . | А 
How did the Wagner Act affect labor's rights to bargain collectively? 
The essential characteristics of the task set by an essay test are that 
each student 
rs, with a minimum of constraint. 


handwriting). 


1. Organizes his own answe 
2. Uses his own words (usually his own 
3 11 number of questions. 


Answers à sma 
l degrees of completeness and ac- 


4. Produces answers having a 
curacy. 


36 THE TEACHER'S OWN TESTS 


In these characteristics lie both the strengths and weaknesses of the 
essay examination. Let us consider each in turn. T 

The Student Organizes His Own Answers. Hercin lies the 5 
tive advantage of the essay examination. It requires the SEHR : 
produce, rather than merely to recognize, the answer. T hus, is 
minimizes the possibility of getting the answer by blind guessing or 
by using little cues to outguess the test maker. 
tions are well prepared, bring 
portant facts or ideas, rcl 


It can, if the ques- 
out the examince’s ability to select im- 
ate them to one another, and organize them 
into a coherent whole. Emphasizing this inte 
uct, it clicits, so it is claimed, 
preparing for it. 

The Answer Is in the Student's 
point a premium is placed upon 


grative type of prod- 
better study habits in those who are 


Words and Handwriting. At this 
verbal fluency and skill of expres- 
sion. The student who is able to write effectively will often get a 
higher grade than another student who cl 
attractive garb. Too often verbal flueney and aggressive salesman- 
ship, bluffing, in short, pass for knowledge of the 
tion to skill in writing, quality 
the grade on an е 


othes the same ideas in less 


subject. In addi- 


of handwriting frequently: influences 
ty test. How often has 
because the instructor became 


could not be bothered to deciphe 
written expression and g 
tives of the educ 


a student been penalized 
irritated by poor handwriting, or 
т obscure “hen track Effective 
ood penmanship may be legitimate objec- 
ational enterprise, but they should be evaluated in 
their own right, They should not be allowed to contaminate our ap- 
praisal of a student's understanding of the causes of Hitler's rise to 
power or of Newton's laws of motion,, 

The Test is Limited to a Small МЧ, 
individual must organize 
with questions like 
evitably limited, 
makes it 


umber of Questions, When the 


and compose an answer of some length, as 
those on р. 35, the 


The time required 
impossible to include 
even a fairly lengthy 
call a “lumpy” 


number of questions is in- 
to answer a single question 
More than i 
test. This tene 
sampling of what the 
or five big shafts into the mine 
If these happen to hit p 
they hit the gaps in h 
number of sample 
We may get a ve 
edge. 

Of course, it is possible to ask free 
quite short answers. 


live or ten questions in 
Is to result in what we might 
student knows, We sink four 
of knowledge that the student Hos 
ау dirt, the student does well; but if 
is knowledge, he does poorly. With this small 
5, chance is likely to play a rel 
ry unfair sample o 


atively large part. 
fa particular student's knowl- 


-response questions that call for 
We might ask: What did the Taft-Hartley Act 


THE ESSAY TEST 37 


provide with respect to strikes endangering the national welfare? 
For this question, we might wish no more than a sentence for an 
answer. Or we might ask: What major labor organization is organ- 
Questions such as these are transitional be- 
They can be numerous and 


ized on a craft basis? 
tween the essay and the objective test. 
can sampie many items of knowledge or understanding. However, 
they sacrifice the main feature of the essay question—the require- 
ment that the examinee put together an organized answer in which 
he relates, evaluates, and integrates a number of facts or ideas. 
Answers Are of All Degrees of Correctness. The bugaboo of the 
essay examination is the laborious and subjective operation of 
That it is laborious any teacher who has 


evaluating the answers. 
sized class can 


ever graded a set of essay papers for even a middle- 
testify. That the grading is subjective and relatively undependable 


has been shown by a number of separate studies. 

Consider the following answer written in 1953 to the question, 
“What are the main differences between the American Federation of 
Labor (AE of L) and the Congress of Industrial Organizations 


(CTO)2" 

; begun a long way back. It began before 1900 but 
av from it in the 1930's. The head of the АЕ of L is 
William Green and the head of the CIO is Philip Murray. AF of L is 
mostly made up of carpenters, plummers and high class workers like 
that. CIO is automobile workers and steel workers. John E Lewis 
don't get along very well with either of them. In polities 105 AF - 
is more like the Republicans and anyway it don FERE os agin poli- 
lies so much like the CIO does. They got the PAC- Political i ction 
Back in the 1930's the C IO used sit-down strikes. 


The AF of L w 
the CIO pulled aw 


Counsel or something. 
two groups of graduate students in educa- 
f teaching social studies and one in 
The graduate students in the 


This answer was given to 
tion, one in a course on methods О 
ourse. 


an intr story measurement с Я ; А 
troductory à ial studies were given the following 


course on methods of teaching soc 
Instructions: 

as one of five included in a high-school 
i Twenty points is the maximum score 
question, using the standards that 
t vou had taught. The grade 
d quality of understanding 


Instructions: The question W 
course on current social problems. 
Please grade this 
school group tha 
of information an 


for the question. 
you would apply to a hig! 
Is to reflect completeness \ 
not quality of English expression. 
" ourse were given 
The students in the introductory — ө n A et شین‎ 
E E wer for gr. ¢ 8 se. 

the same instructions plus a model answer 8 


38 THE TEACHER’S OWN TESTS 


Suppose that you grade this answer in accordance with the instruc- 
tions given above before vou read any further. Record the score 
that you would give the answer. 


Now look at Table 3.1, which shows the scores actually given by 
the members of these two class groups. The range in cach group is 
from close to a perfect score to just a little above zero. Ratings 
spread out over that whole range. There is, of course, а piling up 
at popular numbers like 5, 10. and 15. Did our pupil write a good 
answer to the question? It would be hard to tell by any single score 
that he received. It might have been 3, it might have been 18. The 
judgments of his answer were certainly varied and undependable. 


Table 3.1. Grades Given to Answer on Essay Question 


(Possible score for the question is 20) 


Students in Students in 
Measurement Social Studies . 
Score Course Course 

19 1 
18 2 
17 1 1 
16 2 
15 11 4 
14 4 1 
13 2 2 
12 5 7 
11 4 3 
10 13 12 

9 7 

8 Б] 5 

7 4 1 

6 2 

5 3 5 

4 1 

3 1 2 

2 1 


Of course, in this illustration the dice were loaded against the 
grader. The single answer was taken out of context. No informa- 
tion was available as to the extensiveness of class treatment of the 
topic, the nature and standards of the particular school, or the 
quality of other answers. However, even when the setting is pre- 
sented, there is still marked variability, both between raters and 
within the same rater from time to time. Thus, Table 3.2 shows the 


THE OBJECTIVE TEST 39 


grades given to the same set of papers when they were graded twice 
by the same teacher with a 2-month interval. We see from this 
table that of the three students who failed (passing mark 75) on the 
first marking, two were passed on the second marking. The four 
students who tied for second place on the first marking were first, 
third, tied for sixth, and tied for eighth on the second evaluation. 

Table 3.2. Re-marking of Ten Examinations by Same Teacher after 


2-Month Interval 


Pupil No. First Marking Second Marking 
1 85 70 
2 50 15 
3 90 95 
4 90 85 
5 90 70 
6 99 90 
7 70 60 
8 15 80 
9 60 
10 90 75 


Educational Bulletin No. 18, 


From E. W. Tiegs. Educational diagnosis. 
California Test Bureau, 1952. 

are fairly typical of the situation that pre- 
The responses vary in many 
Evaluation of these responses 


The two illustrations 
vails in evaluating essay 
ways and by infinitely small degrees. 
is highly subjective and generally quite unreliable. 


responses. 


THE OBJECTIVE TEST 

The objective or "new-type" test includes а variety of forms of 
test task having in common the characteristic that the correct 
answer, usually only one, is determined when the test item is writ- 
ten, Common forms of objective test items are shown below. 


True- False 


p The American Federation of Labor consists primarily of craft 


unions. 


Multiple Choice 

The chief cause of the decline of the Knights of Labor was that the organi- 
zation 

ers to join. 

and neglected the workers. 

1 better wages for its members. 
gement by its leaders. 


A. allowed all types of work 
В. became involved in politics 
C. failed to use the strike to gait 
D. lost most of its funds owing to poor mana 


40 THE TEACHER'S OWN TESTS 


Completion 


A court order that forbids a union to strike is called (an injunction) 


Matching 


D 1. Founder of the American Federa- А. William Н. Sylvis. 
tion of Labor. B. John L. Lewis. 

E 2. Founder of the Knights of Labor. C. Walter Reuther. 

B 3. Founder of the Congress of Indus- D. Samuel Gompers. 


trial Organizations. . Uriah S, Stevens. 


The essential features of a test made of objective items, as distinct 
from an essay test, are that the examinee 


1. Operates within an almost completely structured task. 

2. Selects one of a limited number of alternatives. 

3. Responds to each of a large sample of items. 

+. Receives a score for each answer according to a predetermined 


key. 


Again, let us examine these characteristics to see the advantages 
and disadvantages of cach. In large measure, they are the reverse 
of those discussed for essay examinations. 

The Task Is Completely Structured. The examinee does not have a 
chance to organize and define the problem for himself. On the debit 
side, this means that a test of this sort is not useful for appraising 
skills of organizing and structuring ideas. On the credit side, we are 
more sure that cach examinee is presented with the same problem. 
"Discuss the rise of labor unions in the United States after the Civil 
War," can carry quite different meanings to different readers. 

The Examinee Selects fron Among Given Alternatives. In most 
types of objective item, the possible alternatives are completely 
specified. (This is not the case with the completion type of item, 
and in that respect it is on the boundary line, approaching the short 
free-response type of question.) Where the alternatives are all pro- 
vided, the student is only required to recognize the right answer, 
not to produce it by his own efforts. This has been criticized as 
representing a lower level of intellectual process, and one that is less 
true to life. How valid this criticism is probably depends upon how 
skillfully the objective items are written, and how much they manage 
to get away from the words of the text and simple memory of factual 
materials. When an objective test item presents a new problem that 
must be solved by recalling and applying facts or principles previ- 
ously learned, this type of item can require just as active recall as 
any essay question. 


SUMMARY COMPARISON 41 


Another outcome of the limited set of answer choices is that an 
ол uninec can be expected to get some answers right by guessing. 
This becomes a problem particularly for true-false questions in 
which there are only two choices. Tossing a coin would give 50 per 
cent right on the average, and people would get different scores to 
some extent because they were lucky or unlucky coin tossers. The 
problem of guessing would be serious in a short test with few answer 
choices, but chance successes will tend to even up in the long run if 
there are enough items, if enough time is given for everyone to 
complete the test, and if instructions about guessing can be made 
sufficiently definite so that all examinees will adopt the same 


policy. 

The Sample of Items Is Large. 
items can be included. These can be spread more evenly over the 
topics to be covered and 
tained, This reduces the role of luck, of the individual just happen- 
ing to have reviewed a particular topic. Аз a consequence of this 
items to evaluate the individual, the 
tive test is likely to be more accurate 
f an individual will 


Since each item is brief, many 


a more representative sampling can be ob- 


inclusion of many separate 
score from a well-made objec 
than an ess so that two separate tests O 
rank him in more nearly the same place in his group. 

Each Item Has a Predetermined Key. The key is established once 
and for all by the test maker at the time the test items are written. 
"This means that scoring the test is a routine clerical task and can 
be done by a person who knows nothing about the subject matter of 
the test or even by one of the electrical test-scoring machines on the 
market. The saving in time to score the test is very substantial, 
but it must be remembered that much of that saving will have been 
used up in preparing the test. Writing clear and unambiguous ob- 
jective test items is a fairly demanding literary task. 

The economy in time is less important than the uniformity that 
results. The score will be the same whoever scores the test, once the 
key has been agreed upon. The score will be the same no matter 
Who it was that chose the answers. Teacher's pet or hellion, Spen- 
Cerian specialist or scribbler, if they choose 


ay test, 


the same answer they 
get the same score. 


SUMMARY COMPARISON 
have been discussing 
a plus sign is place 
judged superior 


are summarized in tabular 
xd in the column of the 
with respect to that 


"The issues we 
form below. In each case 
test pattern that would be 
factor. 


42 THE TEACHER'S OWN TESTS 


Factor Essay | Objective 
Provides opportunity to test student's ability to select, or- 
ganize, and integrate + 
Requires student to produce answer and not just recognize it + 


Is free from factors of skill in expression and penmanship + 
Is free from opportunities for bluffing + 
Is free from opportunities for guessing + 
Provides an adequately representative sample of the topics 

covered + 
Can be prepared quickly + 
Can be scored quickly + 
Can be scored routinely by a clerk + 
Can be scored with high consistency from scorer to scorer + 


The balance of importance between these factors will vary from 
situation to situation. It is clear that neither type has exclusive 
claim to all the advantages. In evaluating the work of his class, 
the teacher may well wish to use both kinds of testing procedures. 


EFFECTIVE USE OF THE ESSAY EXAMINATION 


In part because of the case with which they can be prepared, in 
part because of their advantages in evaluating abilities to organize 
an answer to a question, recall and select relevant information, and 
present it logically and effectively, essay examinations will continue 
to be used in evaluation of pupil performance. If they are to be 
used, the teacher should have some guiding principles as to when to 
use them and should do what he can to overcome their common 
weaknesses. These weaknesses are found partly in the format of the 
questions and partly in the proce 
duced by the students. 


ss of evaluating the answers pro- 


WHEN TO USE ESSAY EXAMINATIONS 


The factors that make it appropriate to use an essay examination 
are in part very immediate practical ones, in part more fundamental 
theoretical considerations. 

Immediate Practical Considerations. An essay examination is a 


practical compromise when the class group is small or when time to 


prepare the examination is limited. It is easy to prepare, and when 
the class is small the burden of reading the papers does not over- 
balance the saving in preparing the test. 

A consideration that may be compelling in some cases is lack of 
reproduction facilities for running off copies of the test. Then a set 


of essay questions written on the blackboard is a practical solution. 


WHEN TO USE ESSAY EXAMINATIONS 43 


Another practical solution is to read objective questions to the class. 
This procedure works fairly well for the simpler types of items but 
rules out items with more complex structure. 

A third point that is sometimes made is that essay questions are 
less demanding upon the skill of the teacher. It is probably true 
that ambiguities and poor expression are more apparent in an objec- 
tive item, though confusion as to what is wanted in response to an 
essay question can also be substantial. Many of the faults in writing 
objective items can be avoided once they have been pointed out, so 
that it seems more desirable to improve item-writing skills than to 
resort to essay questions as a defense. 

More Basic Theoretical Issues. The functions that can be ap- 
ıy question than by short-answer questions 
аге abilities to select, relate, and organize, to create essentially new 
patterns, and to use language to express one's ideas. Essay ques- 
assume a prominent place 


praised better by an es 


tions are appropriate when these abilities 
among the outcomes that are to be appraised. There is little justi- 
fication in using the essay examination when the questions included 
call largely for the reproduction of factual information. Possession 
of this information could be tested more efficiently by a series of 
objective test items. 

If essay questions 


are to be used to assess abilities to select and 
organize, to create new syntheses, and to express oneself, they must 
be phrased so as to elicit these functions. The question must call 
for an application or creative synthesis of what has been taught, 
rather than reproduction of facts as they were learned. Thus, the 


question 
п-Нагцеу Law make with respect to 


What provision does the 
jurisdictional strikes? 

tests only information. The problem 
Pickets from the Seaman's Union refused to man the SS Colossal and 
picketed the pier from which she sailed because the a bakers, ета 
kitchen workers belonged to the Hotel \\ orker s Union. he eer hip 
company took the case to court. What decision could be expected from 
the court? Why? 
the proper items of informa- 
on of a new problem. The 


priate for an essay exam- 


and selection of 


requires identification А 
tion, and their application to the soluti 
Second question seems clearly more appro 
ination. 

Evidence has been presented 77. 
activities C 


x] 244 that the. prospect of an essay 
asizing the organizati 
examination leads to study mphasizing g ion 


44 THE TEACHER'S OWN TESTS 


and interrelationships of facts and principles in an area whereas the 
prospect of an objective examination leads to the memorizing of dis- 
crete details. In so far as this is the case, it represents one strong 
argument for retaining essay examinations, in spite of their limita- 
tions as measurement devices. One wonders, though, whether this 
is a necessary relationship or whether it is merely a reflection of the 
low quality of the objective tests to which the groups had been 
exposed. 

Variants on the Essay Examination. Values claimed for the essay 
examination are those of appraising ability to organize materials and 


to use language effectively to express the resulting organization. 


However, in the usual scheduled essay examinations these functions 
may become submerged because (1) differences in knowledge of the 
basic facts hide differences in ability to organize those facts and (2) 
time pressures hide the quality of the individuals’ written expr 

Two variations may be considered that appear likely to bring out 
the factors in which we are particularly interested. One is to give 
an "open book" examination, in which every individual has access 
to апу basic data present in his text, his notes, or other sources. 
Memory of facts is then reduced as a factor entering into individual 
performance, and ability to locate, select, and use the facts is brought 
to the fore. 

The second variation is to give the problems as an out-of-class 


ssion. 


examination with unlimited time. This minimizes time pressure, 
and makes the test more nearly a pure power tests power both with 
respect to organizing ability and with respect to written expression. 
We do, of course, introduce a new problem, since we are less able to 
guarantee the integrity of the written material turned in. When the 
examination is used against rather than for the pupil, illicit help is 
likely to become a serious problem. 


PREPARING ESSAY QUESTIONS 


One of the common weaknesses of essay questions is that they are 
too vague, general, and comprehensive. This has a number of dis- 
advantages. So much time is required to answer a single question 
that very few questions can be included. Even with a small number 
of questions, the student may feel pressed for time and be handi- 
capped by his limited speed of writing. In his selection from among 
all the things he might write about, it may be in good part a matter 
of luck whether he picks out the aspects of the topic that the particu- 
lar instructor considers to be important. A question such as "De- 


PREPARING ESSAY QUESTIONS 45 


scribe the significant developments in organized labor since the Civil 
War" would appear to be subject to these criticisms. 

In view of these criticisms, it would seem appropriate to replace 
the one long, general question with one or more of a series of shorter, 


more specific ones. Thus, the question indicated above might be re- 


placed һу one or more of the following four: 


1. Why did the Knights of Labor fail as a labor organization? 

2. Describe briefly three of the methods used by employers to fight 
period 1890 to 1915. 

Taft-Hartley Act that are particu- 

and explain why labor considers 


unions during the 
Pick out three provisions of the- 
sive to organized labor 


larly offe! 
each unfair. 
4. In what wa 
are they different? 


are the AF of L and the CIO alike and in what ways 


less words and, consequently, less 
the task is much more clearly and 
We can feel more sure 


Questions such as these require 
time to answer. Furthermore, 
unequivocally defined for each examinee. 
that cach student is responding to the same problem. 

How far to go in breaking the subject matter up into precise ques- 
; a problem, however. The more 
the less selection, organization, 
In the extreme, questions 


tions with specific brief answers is 
specific the task set by the question, 
rials it calls for. 
limited that they approach the short- 
“What did the Taft-Hartley Act pro- 
is of this sort. When the essay 
onvenient for the instruc- 


and integration of mate 
may become so specific and 
answer form. The question 
to the closed shop?" 
cause it is С 
ber of questions, each of which 
may be a fairly satisfactory 
such a test will have lost 


vide with respect 
format is being used merely be 
tor, а test made up of a large nun 
can be answered in a sentence or two. 
substitute for an objective test. However, att 
much of the distinctive value of the essay examination, 

When the essay examination is being used with some genuine 
interest in assc ssing ability to select, organize, and integrate material, 
an intermediate point on the scale from general to specific 15 prob- 
ably the optimum. The st have enough scope so that 
some selection and integration on the part of the examinee 15 called 
for. But enough guidance should be provided so that cach student 
is oriented toward the same defined task. In part, this guidance may 
ng the scope of the question; in part, 1t may 
a fuller statement of the problem. This is 
for the general question considered 


question mu 


be provided by restricti 
be provided by giving 
shown on the following page 
earlier. 


46 THE TEACHER'S OWN TESTS 


Example 


Describe the significant developments in organized labor since the Civil 
War, considering especially 

a. the nature and role of different labor organizations. 

b. the reaction of management to labor's efforts to organize. 

c. legislation affecting labor. 

d. the role of the courts. 

е. the role of public opinion. 


This type of analysis of the task guarantees a more common basis 
for response. In one sense it breaks the one question up into five. 
(The analysis also makes clear that the or 
suitable for an all-day comprehensive examination, rather than one 
of more limited scope.) 


val question would be 


When an essay examination is being used to appraise achievement 
of the objectives of a common program of study, cach examinee 
should be required to answer the same questions. Giving a choice 
of questions reduces the common base on which different individuals 
may be compared. It adds one further source of variability to the 
subjectivity and unreliability that alrea‘ly exist. A choice of ques- 
tions may have public relations value with the 
no justification from the point of view of effective measurement. 


xaminces, but it has 


SCORING ESSAY EXAMINATIONS 


A number of steps may be taken to mitigate the subjectivity and 


reduce some of the biases in evaluating the answers to an e say 
examination. These are mostly attempts to break up the process of 
evaluation into a series of more specific, fractionated judgments 
made upon a common base and applied to an anonymous product. 
Specific suggestions are outlined below. 

1. Decide in advance what factors are to be measured. If more than 
one distinct quality is lo be appraised, make separate evaluations of 
each. If facts are considered important, score for facts. If organiza- 
tion is important, give a rating upon organization. If mechanics of 
English, sentence structure, spelling, punctuation, ete., are con- 
sidered a significant outcome, give a rating upon mechanics. How- 
ever, do not contaminate the rating for knowledge or understanding 
with appraisal of mechanics. It is harder to isolate quality of 
organization from extent of factual information, but if the essay 
question is to serve its distinctive purpose an attempt should be 
made to do so. 

2. Prepare a model answer in advance, showing what points are de- 
sired and the credits to be allowed for each. This will provide a com- 


SUMMARY STATEMENT 47 


mon frame of reference for evaluating the single papers. When the 
preliminary model has been prepared, it should be checked against 
а sample of student responses to the question. The model and the 
scoring scheme should be modified in the light of these answers. 
After it is suitably modified, it can be used as the yardstick for 
assigning credits to each paper in turn. 

3. Read all answers to one question before going on to the next. A 
More uniform standard can be maintained for a single question and 
for a short period of time. There is more chance to compare one 
person's answer with another's and thus to build up a "feel" for the 
There is less contamination of judgment by what that 
same examinee had written on the previous question. 

4. Grade the papers as nearly anonymously as possible. The less 
vou know about толо wrote an answer, the more objectively you can 


answers 


grade what was written. 
5. Greater reliability can be obtained by averaging independent rat- 


ings. If the importance of the test merits the expenditure of effort, 
à more dependable appraisal can be obtained by having one or more 
additional raters, each of whom gives an independent rating of the 


responses, 
SUMMARY STATEMENT 


Evaluation of pupil achievement is one of the teacher's important 


responsibilities. In view of the many functions that tests serve in 
Motivating and directing learning, and in view of the disservice that 
may be done the pupil from poorly conceived or executed evaluation 

r's evaluation devices be 


instruments, it is important that the teacher s 
Both written tests and a variety 


well thought out and well made. i 
of informal appraisals are needed to evaluate completely the objec- 
tives of the modern curriculum. | m 
For any type of written test, it is desirable to have a definite plan 

in advance of preparing the test items. The development of such a 
outcomes one is trying to achieve in 


an analysis of the s toa 
t and of the significant seg- 


plan requir 
ar course or unl 


the teaching of a particul jd pei 
ments of content through which those objectives are ое realized. 
A statement of objectives useful for the — "c s 
must be phrased in terms of pupil behaviors pen 5 t = he 
Pupil is supposed to be able to do--rather than in broad generaliza- 

include the allocation of test 


tions. In addition, the plan should н 
s. č 3 

i jective > types of items to 
Items among the content are ; and objectives, the гу] 


be used, the total number of items in the test, and specifications for 


the spread of item difficulties. 


48 THE TEACHER'S OWN TESTS 


Both essay and objective tests can be used to evaluate pupil 
achievement. The essay test is easier to prepare and has certain 
advantages in appraising ability to recall information, select relevant 
material, and organize it into an integrated answer. However, the 
objective test has marked advantages in freedom from such irrele- 
vant factors as quality of handwriting or of English usage, in breadth 
of sampling of the desired outcomes of teaching, and in case and ob- 
jectivity of scoring. 

Essay questions can be improved by phrasing the question so as 
to present a well-defined task to the student and by providing con- 
ditions for scoring that reduce as far as possible the subjectivity of 
grading. 


REFERENCES 


1. Douglass, Н. R., and Margaret Tallmadge, How university students pre- 
pare for new types of examinations, Sch. & Soc., 1934, 39, 318-320. 

2. Meyer, G., An experimenta! study of old and new types of examination, 
J. educ. Psychol., 1934, 25, 641-661 and 1935, 26, 30-40. 

3. Terry, P. W., How students review for objective and essay tests, Elem. 
Sch. J., 1933, 33, 592-603. 

4. Terry, P. W., How students study for three types of objective tests, 
J. educ. Res., 1934, 27, 333-343. 


SUGGESTED ADDITIONAL READING 


Lindquist, E. F., Preliminary considerations in objective test construction, 
Chapter 5 in E. F. Lindquist, Editor, Educational measurement, Washington. 
D. C., American Council on Education, 1951. 

Monroe, Walter S., Encyclopedia of educational research, rev. ed., New York, 
Macmillan, 1950, pp. 403—406; 407—412. 

Odell, C. W., How to improve classroom testing, Dubuque, Iowa, William C. 
Brown, 1953, Chapters 4 and 5. 

Smith, Eugene R.. et al., Appraising and recording student progress, New York, 
Harper, 1942, Chapters 1 and 2. 

Stalnaker, John M., The essay type of examination, Chapter 13 in E. F. 
Lindquist, Editor, Educational. measurement, Washington, D. C., American 
Council on Education, 1951. 

Tyler, Ralph W.. The functions of measurement in improving instruction. 
Chapter 2 in E. F. Lindquist, Editor, Educational measurement, Washington, 
D. C., American Council on Education, 1951. 

Vaughn, K. W., Planning the objective test, Chapter 6 in E. F. Lindquist. 
Editor, Educational. measurement, Washington, D. C., American Council on 
Education, 1951 

Weitzman, Ellis, and Walter J. McNamara. Constructing classroom examina- 
tions: a guide for teachers, Chicago, Science Research Associates, 1950, Chap- 
ters 1 and 2. 


QUESTIONS FOR DISCUSSION 49 


QUESTIONS FOR DISCUSSION 


1. Prepare a statement of the objectives for a course, or a unit within a 
course, that vou are teaching or plan to teach. 
2. Using the statement of objectives from 1 and a course outline, prepare 


à blueprint for a test to evaluate the unit or course. 
3. Indicate the objectives from 1 that could be measured only partially or 


not at all by a written test. Why is the measurement incomplete? How else 
might these objectives be appraised? Why do you select these procedures? 
4. It has been said that one of the goals of the music program in an elemen- 
‚ the sensitivity of pupils to music in its different 
toward it could be 


lary sc 
forms. 


measured? 
5. Students are sometimes heard to remark: “You can't get a good mark 


on Miss N's tests unless you really know Miss X.” What does this remark 


imply about Miss N's tests? А а . 
6. On p. 42 is a list of factors that have been presented as favoring either 


ssay or objective tests. Do you agree with the classification given there? 
Which are the most important factors? What other points should be consid- 
ered in deciding which type of test to use for the final examination in a par- 
ticular course? 

7. Criticize the following plan for an € 
grade class: 


hool is to "incre 
How could this goal be defined зо that. progr 


8 


av test in social studies for a ninth 


a. There will be 10 questions on the test. 

b. Each student will answer any 3. . Р 

с. Each question will have a value of 20 points. "T— 

d. One point will be taken off for each misspelled word and each gram- 
matical error. 

е. А 5-point bonus will be | 

f. Time for the test will be 40 minutes. 

following essay questions: 


given for neatness. 


8. Criticize and revise each of the 


in juvenile delinquency since World War II. 


a. Discuss the increase 
prices. 


b. Discuss government support of farm 


¢. Discuss the‘ 


cold war.“ 


А - ¢ essay examination be 
9. For what types of objectives would an open-book essay examinati 


„ныр isadvantages of such an 
appropriate? What would be the advantages and ae” s of 
7 : ; sual essay examine ? 
examination, as compared with the usual essay exa 


Chapter 4 
Preparing Objective Tests 


The objective type of test item was developed in order to over- 
come some of the disadvantages of the essay test discussed in Chap- 
ter 3. Among teachers and students there is still a good deal of 
argument about the relative merits of the two types of test. Those 
who object to the objective type of test say that it emphasizes 
factual material, encourages piecemeal memorization of unimportant 
details, and neglects the more important educational objectives. 
Most of these objections to tests of the objective type have been 
caused by poor planning for the test, poor item writing, or poor choice 
of the type of objective item. A poorly constructed objective test can 
inhibit learning but then so can a poorly constructed essay test. 

In this chapter we will consider methods of improving and using 
the objective type of item and of analyzing and using the results of 
objective tests. 


WRITING THE ITEMS FOR AN OBJECTIVE TEST 


Writing good test items is an art. It is a little like writing a good 
sonnet and a little like baking a good cake. The operation is not 
quite so free and fanciful as writing the sonnet; it is not quite so 
standardized as baking the cake. It lies somewhere in between, So 
a discussion of item writing lies somewhere between the exhortation 


to the poet to go out and express himself and the precise recipes of a 
good cookbook. The point we wish to make is that we do not have 
a science of test construction. The guides and maxims that we shall 
offer are not tested out by controlled. scientific experimentation. 
Rather, they represent a distillation of practical experience and pro- 
fessional judgment. As with the recipe in the cookbook, if carefully 
followed they yield a good product. 
We shall first go over some suggestions that apply to almost any 
type of objective item. Then we will consider specific item types. 
50 


GENERAL MAXIMS FOR ITEM WRITING 51 


indicating some of the general virtues and limitations of the ty f 
ids and giving more specific suggestions for writing and ML. i3 
„ e tiiat we set forth will seem very novias 
b bate е n уен and editing items indicates that 
н 8 E side о {5 are the ones that are most frequently com- 
A cedit pi s id b ho try to prepare objective tests. Thus, it 
ia v seems necessary to insist that a multiple-choice item must 
Mee one and only one right answer, and vet items with no 75 5 
answer or several occur again and again in tests that are carelessly 


prepared. 


GENERAL MAXIMS FOR ITEM WRITING 

1. Keep the Reading Difficulty of Test Items Low in relation to the 
r 4 
group who are to take the test, unless the purpose is to measure 
Ordinarily vou do not want language 


verbal and. reading abilities. 
s opportunity to show what he 


difficulties to interfere with a pupil 
knows. 
Example 

1; Poor: The legislative enactment most distasteful to the protagonists of 
abor has been the 

А. Walsh: Healy Act. 

B. Norris-La( zuarclia Act. 

£ Wagner Act. 

D. Taft-Hartley Act. 


Better: The law to which labor supporters have objected most has been the 


А. Walsh-Healy Act. 

В. Norris-LaGuardia Act. 
C Wagner Act. 

D. Taft-Hartley Act. 


2. Do Not Lift a Statement Verbatim from the Textbook. This places 
a minimum of understanding. А 


Better still, in many cases it 
in an application. 


a premium upon rote memory with 
Statement can at least be paraphrased. 
may be possible to imbed the specific knowledge 


Example 
r forbidding specified actions, such as 


a court orde 


T 


Poor: An injunction is 
striking or picketing by unions. 

Better: (Paraphrased) If a court 
Strike, this order would be called an 


forbidding Union N to 
T F 


choice and true-false items 


ied an order 


injunction. 


* Я Е 
Тһе keved response will be underlined for multiple- 
and filled in for completion or matching items. 


52 PREPARING OBJECTIVE TESTS 


3. If ап Item Is Based on Opinion or A uthority, Indicate Whose 
Opinion or What Authority. Ordinarily statements of a controversial 
nature do not make good items, but there are instances where know- 
ing what some particular person thinks may be important for its 
own sake. The student should presumably be acquainted with the 
viewpoint of his textbook or instructor, but he should not be placed 
in the position of having to endorse it as indisputable fact. 


Example 
Poor: The basic cure for jurisdictional disputes is to remove the fear of un- 
employment. ТЕ 
Better: According to Faulkner and Starr. the basic cure for jurisdictional 
disputes is to remove the fear of unemployment. T F 


4. In Planning a Sel of Items for a Test, Care Must Be Taken that 
One Item Does Not Provide Cues to the Answer of Another Item or 
Items. The second item below gives cues to the first. 


Example 
1. A court order restraining a union from striking is called 

А. a boycott. 

B. an injunction. 

C. a lockout. 

D. an open shop. 

2. The Taft-Hartley Act provides that an injunction may be called for 80 
days to prevent strikes 

A. of government or municipal workers. 

B. of public utility workers. 

C. arising out of jurisdictional disputes. 

D. which threaten to endanger the public welfare. 

5. Avoid the Use of Interlocking or Interdependent Items. The 
answer to one item should not be required as a condition for solving 
the next item. This is the other side of the principle stated in 4 
above. Every individual should have a fair chance at each item as 
it comes. Thus, in the example shown below, the person who does 
not know the answer to the first question is in a very weak position 
as far as attacking the second one is concerned. 


Example 


1. The new labor technique introduced in the big automobile and steel 
strikes of 1937 was the — (sitdown strike) 
2. Public reaction to this technique was generally | (unfavorable) 


GENERAL MAXIMS FOR ITEM WRITING 53 


6. In a Set of Items, Let the Occurrence of Correct Responses Follow 
entially a Random Pattern. Avoid favoring certain responses, i.e., 
either true or false, or certain locations in a set of responses. Do 


not have the responses follow any systematic pattern. 

7. Avoid Trick and Catch Questions, except in the rare case in 
which the test has a specific purpose of measuring ability to keep 
out of traps. Trick questions are likely to mislead the abler or 
better-informed student, who knows enough to be caught by the 
trap. If they do this, they defeat the basic purpose of the test, 
which is to identify levels of knowledge and understanding. 


Example 


Under the leadership of John L. Lewis, the Congress of Industrial Organize 


1 


tions broke away from the American Federation of Labor in 1935. F 
(This would have to be scored false, because at that date the organization 


was known as the Committee for Industrial Organization.) 


8. Try to Avoid Ambiguity of Statement and Meaning. This is a 
general admonition, somewhat like "Sin no more,” and it may be no 
more effective. However, it is certainly true that ambiguity of state- 
sive fault in objective test items. 


ment and meaning is the most регуа 
Many of the specific points already. covered and to be covered deal 


ic aspects of the reduction of ambiguity. 


With spec 


Example 


The general trend in union membership since 1880 has paralleled very 
closely 


A. busines: 

B. general economic conditions. 

C. the labor force. we 

D. fluctuations in the cost of living. 

The keyed answer to the above question was B, but the examinee 

the item is faced with several problems. First of 

"general trend”? Does this mean the general 

increase from 1880 to 1950, or does it also refer to all the ups and 

downs in between? Secondly, what did the writer have in mind 

$m эс i 3 a — 

When he wrote "union membership ? Does it mean the actual 

or does it mean the per- 


longing to a union, l 
in the labor force who belong to unions? 


relationship between union membership 
to be before one can say that they 


trying to answer 
all, what is meant by 


number of people be 
centage of all of the people 
Third, how close does the 
ànd any one of the options have 
parallel each other very closely? 


54 PREPARING OBJECTIVE TESTS 


Now look at the options. None of the options make very satis- 
5 cycles,” and 


factory completions for the stem. Option A, “busine 
option D, "fluctuations in the cost of living," are included within 
option B, “general economic conditions.” Option C is not clear. 
Does the writer mean the number of people in the labor force as a 
whole? Does he mean the occupational distribution of the people 
in the labor force? Does he consider unemployed workers or part- 
time workers as part of the labor force? 

The item needs to be sharpened up in several respects. The 
example below would appear to test the same knowledge and to 
provide less occasion for misunderstanding of what the examiner 
was trying to say. 


Example 


most rapidly 


A. when economic conditions were good. 

B. during periods of economic depression. 

C. after court decisions that were unfavorable to labor. 
D. when factories moved to rural and undeveloped areas. 


9. Beware of Items Dealing with Trivia. An item on a test should 
appraise some important item of knowledge or some significant under- 
standing. Avoid the type of item that could quite justifiably be 
answered, "Who cares?" Ask 


yourself in cach case whether knowing 
or not knowing the answer would make a significant difference in the 
individual's competence in the area being appraised. 


Example 
Poor: The Taft-Hartley Act was passed in 
A. 1945. 
B. 1946. 
С. 1947. 
1). 1948. 


Better: Which of the following contract provisions between management and 
labor would be specifically prohibited under the Taft-Hartley Act? 


А. All newly hired. employees must join the union within 90 days after 
employment. 
B. No person can be employed unle 


he is already à member of the union. 

Union members will be given preference in the hiring of new employees. 

). New employees will be hired without regard to union membership or 
promise to join the union. 


ool 


TRUE-FALSE ITEMS 55 


TRUE-FALSE ITEMS 
р The true-false item has had a popularity in teacher-made objec- 
tive tests far beyond that warranted by its essential nature. This is 
probably because bad true-false items can be written quickly and 
сазПу. То write good ones is quite a different matter. Even when 
they are well written, true-false items are relatively restricted in the 
types of educational objective they can measure. They should be 
limited to statements that are unequivocally true or demonstrably 
false. For this reason, they are adapted to measuring relatively 
Specific, isolated, and often trivial facts. They can also be used 
fairly well to test meanings and definitions of terms. But items test- 
ing genuine understandings, inferences, and applications are usually 
very hard to cast in truc-false form. The true-false item is particu- 
larly: open to attack as fostering piecemeal, fractionated, superficial 
learning and is probably responsible for many of the attacks upon 


the objective test. It is also in this form of test that the problem of 


guessing becomes most acute. 
The commonest variety of truc- 
clarative statement, and requires of the examince only 


false item presents a simple de- 
; that he indi- 


cate whether it is true or false. 


Example 


TE From 1950 to 1953 John L. Lewis was the president of the CIO. 


Several variations have been introduced in an attempt to improve 
One simple variation is to underline a part of the 
CIO in the above example. The instructions indi- 
key part of the statement and that it deter- 
false. That is, the correct- 
he statement is guaranteed. 


the item type. 
Statement, 
cate that this is the 
mines whether the statement is truc or 
ness or appropriateness of the rest of t ruar; 
his attention upon the more specific issue of 
with the rest of the state- 


for more consistent 


I he examinee can focus 
Whether the underlined part is compatible 
Ment, This seems to reduce guessing and make 


measurement. s А 

А further variation is to require the examinee to correct the item 
if it is false, This works well if combined with the underlining de- 
nts are intro- 


scribed above but is likely to be confusing if no constral 1 
Our example could be corrected by changing 


bv changing the dates, or by changing 
` Requiring that the item be corrected 
further cue to the individual's 


duced in the situation. 
the name of the individual, 
the name of the organization. 
minimizes guessing and provides some 
knowledge. 


56 PREPARING OBJECTIVE TESTS 


CAUTIONS IN WRITING TRUE-FALSE ITEMS 
1. Beware of "Specific Determiners," words that give cues to the 
probable answer, such as all, never, usually, etc. Statements that 
contain "all," "always," "no," "never," and such all-inclusive terms 
represent such broad generalizations that they are likely to be false. 
Qualified statements involving such terms as “usually” or "some- 
times” are likely to be true. The test-wise student knows this, and 
will use these cues, if he is given a chance, to get credit for knowl- 
edge he does not possess. "All" or "no" may sometimes be used to 
advantage in /rue statements, because in this case guessing will lead 
the examinee astray. 
Example 


Poor: All unions in the AF of L have always been craft unions. "E 
Better: All closed shop contracts require that the workers belong to a union. 


idi 


2. Beware of Ambiguous and Indefinite Terms of Degree or Amount. 
Expressions such as "frequently," "greatly," "to a considerable de- 
gree," and “їп most cases" are not interpreted in the same way by 
everyone who reads them. Ask a class or other group what they 
think of when you say that something happens "frequently," Is it 
once а week or once an hour? Is it 90 per cent of the time or 50 per 
cent? The variation will be very great. (Ed.: How great is very 
great?) Ап item in which the answer depends оп the interpretation 
of such terms as these is an unsatisfactory one. 

3. Beware of Negative Statements and. Particularly of Double Nega- 
tives. The negative is likely to be overlooked in hurried reading of 
an item, and the double negative is hard to read and confusing. 


Example 


Poor: A non-union shop is not one in which an employee must refrain from 


joining a union. ТУР 
Better: Employees in a non-union shop are permitted to belong to a union. 


TF 

4. Beware of Items that Include More than One Idea in the State- 
ment, Especially If One is True and the Other Is False. This type of 
item borders on the category of trick items. It places a premium on 
care and alertness in reading. The reader must not restrict his at- 
tention to one idea to the exclusion of the other or he will be misled. 
The item tends to be a measure of reading skills rather than knowl- 


edge or understanding of subject content. 


MAXIMS CONCERNING COMPLETION ITEMS 57 


Examples 
According to the Taft-Hartley Act. jurisdictional strikes are forbidden but 
the closed shop is approved as an acceptable labor practice. q^ iB 
The CIO is composed of industrial unions, whereas the АЕ of L is composed 
entirely of craft unions. E 
i n cach of these items the first statement is true: the second statement, 
alse.) 


5. Beware of Giving Cues to the Correct Answer by the Length of the 
Item. There is a general tendency for true statements to be longer 
than false ones. This is a result of the necessity of including qualifica- 
tions and limitations to make the statement true. The item writer 
must be aware of this trend and make a conscious effort to over- 


come it. 


SHORT ANSWER AND COMPLETION ITEMS 

The short-answer and the completion item tend to be very nearly 
the same thing, differing only in the form in which the problem is 
presented. If it is presented as a question it is a short-answer item, 
whereas if it is presented as an incomplete statement it is a com- 
pletion item. 

Example 

as president of the CIO? 


Short Answer: Who followed John L. Lewis 20 
nt of the CIO by (Philip 


Completion: John 1. Lewis was followed as preside 
Murray) 
well suited to testing knowledge of vocabulary, 
fication of concepts. and ability to solve alge- 
Numerical problems that vield a spe- 
very nature. 


Items of this type are 
names or dates, identi 
braic or numerical problems. l 
cific numerical solution are “short answer" in their 
complex understandings and applications 
is difficult to accomplish with items of this type. Furthermore, 
evaluation of the varied responses that are given is likely to call for 
some skill and to introduce some subjectivity into the scoring pro- 


cedure. 


Phe measurement of more 


MAXIMS CONCERNING COMPLETION ITEMS | 
1. Beware of Indefinite or "Open" Completion Items. Їп the first 
illustration below, there are many words or phrases that give fac- 


tually correct and reasonably sensible completions to the statement: 
| ^ beetle-browed, " “elected in 1936. The 


"a man," “forceful,” c i 1 
fully defined, as 15 done in the revised 


problem needs to be more 
statement. 


58 PREPARING OBJECTIVE TESTS 


Example 


Poor: The first chairman of the CIO was (John 1.. Lewis) 
Better: The name of the man who was the first chairman of the CIO is 
(John L. Lewis) 


2. Don't Leave Тоо Many Blanks in a Statement; Omit Only Key 
Words. Overmutilation of a statement reduces the task of the 
examinee to a guessing game or an intelligence test. 


Example 


Poor: The (Taft-Hartley Act) makes the (closed) shop (illegal) 
Better: The closed shop was outlawed by the (Tatt-Hartley) — Act. 


3. Blanks Are Better Put Near the End of a Statement, Rather Than 
at the Beginning. This makes the item more like a normal question. 
The respondent has had the problem defined before he meets the 


blank. 
Example 


Poor: A(n) _ (injunction) is a court order that forbids workers to strike, 
Better: А court order that forbids workers to strike is called a(n) (in- 
junction) 


+. If the Problem Requires a Numerical Answer, Indicate the Units 
in Which It Is to Be Expressed. This will simplify the problem of 
scoring and will remove one possibility of ambiguity in the examince's 
response. 


MULTIPLE-CHOICE ITEMS 


The multiple-choice item is the Most flexible and most effective of 
the objective item types. It is effective for measuring information, 
vocabulary, understandings, application of principles, or ability to 
interpret data. In fact, it can be used to test practically any educa- 
tional objective that can. be measured by a pencil-and-paper test 
except the ability to organize and present material. The versatility 
and effectiveness of the multiple-choice item is limited only by the 
ingenuity and talent of the item writer. 

The multiple-choice item consists of two Parts: the stem, which 
presents the problem, and the list of possible answers or options. 
The stem may be presented in the form of an incomplete statement 
or a question. 


MULTIPLE-CHOICE ITEMS 59 


Example 
Incomplete statement: Jurisdictional strikes are illegal under the 
A. Taft-Hartley Act. 
B. Wagner Act. 


C. Walsh-Healy 
D. Fair Labor S 


andards Act. 


Question: Which one of the following labor acts outlawed jurisdictional 
strikes? 

A. The Taft-Hartley Act. 

B. The Wagner Act. 

C. The Walsh-Healy Act. 

D. The Fair Labor Standards Act. 

Inexperienced item writers usually find it easier to use the question 
form of stem than the incomplete sentence form. The use of the ques- 
tion forces the item writer to state the problem explicitly. It rules 
out certain types of faults that may creep into the incomplete state- 
ment, which we will consider presently. However, the incomplete 
statement is often more concise and pointed than the question, if it 
is skillfully used. 

The number of options used in th 
in different tests, and there is no re 
However, to reduce the guessing factor, it is 
preferable to have four or five options for each item. On the other 
tter to have only three good options for an item than 
so obviously wrong that no one ever 


e multiple-choice question differs 
al reason why it cannot vary for 


Items in the same test. 


hand, it is be 
to have five, two of which are 
chooses them. 

The difficulty of a multiple- 
"closeness" of the options and the 
set of three items shown below, 
or “jurisdictional strike." 
that I will be passed by more pupils 
The difference between I and H 
e direct memory of the 


choice item will depend upon the 
process called for in the item. 
Consider the all centered around 
the meaning of "strike" ¢ One can predict 
With a good deal of confidence 


than II, and II by more than III. ! 
(EI calls for quit 
calls for recognition of the concept 
a concrete situation. The differ- 


Is in the process involve 
definition of a term, whereas 1 


embedded in the complexities of i 
ence between II and HI is one of closeness of options—1I calls for 
rather gross discrimination of major concepts, whereas IIT calls for 
s within a single concept. 


differentiation of subvarietie 


60 


II. 


III. 


1. 


PREPARING OBJECTIVE TESTS 


. When the members of a union refuse to work it is called 


А. a boycott. 

B. an injunction. 

C. a lockout. 

D. a strike. 

On a building project the bricklavers were setting up some wooden 
platforms to hold their bricks. Then the carpenters refused to work, 
claiming that this was work that they should do. This is an example of 


A. a boycott. 

B. an injunction. 

C. a lockout. 

D. a strike. 

On a building project the bricklayers were setting up some wooden 
platforms to hold their bricks. Then the carpenters refused to work, 
claiming that this was work that they should do. This is an example of 


A. a general strike. 

B. a jurisdictional strike. 
C. a sit-down strike. 

D. a sympathy strike. 


MAXIMS FOR MULTIPLE-CHOICE ITEMS 


The Stem of a Multiple-Choice Item Should Clearly Formulate a 


Problem. All the options should be possible answers to a problem 
that is raised by the stem. When the stem is phrased as a question, 
it is clear that a single problem has been raised, but this should be 
equally the case when the stem is in the form of an incomplete state- 
ment. Avoid items that are really a series of unrelated true-false 
items dealing with the same general topic. 


interests of economy of space, economy of reading time, and cl 


Example 


Poor: The Taft-Hartley Act 


As 


B. 
e. 
D. 


В. 


Са 


D. 


2. 


outlaws the closed shop. 

prevents unions from participating in politics 
is considered unfair by management. 

has been replaced by the Wagner Act. 


Better: The Taft-Hartley Act outlaws the 
A. 


closed shop. 
preferential shop. 
union shop. 

open shop. 


Include as Much of the Item as Possible in the Stem. In the 


statement of the problem, it is usually desirable to try to word and 


MAXIMS FOR MULTIPLE-CHOICE ITEMS 61 


arrange the item so that the stem is relatively long and the several 
options relatively short. This cannot always be achieved but is an 
objective to be worked toward. This principle ties in with the one 
previously stated of formulating the problem fully in the stem. 


Example 


Poor: Organized labor during the 1920's 


А. encountered much unfavorable Federal legislation. 
В. was disrupted by internal splits. 
C. showed the usual losses associate 
D. was weakened by a series of unfavora 


d with a period of prosperity. 
ble court decisions. 
Better: During the years from 1920 to 1930 the position of organized labor 
was weakened by 

А. much unfavorable Federal legislation. 

B. splits within labor itself. 

C. the effects of business prosperity. 

D. a series of unfavorable court decisions. 


ith Irrelevant Material. In certain 
m may be to test the examinee's 
In this case, it is 


3. Don't Load the Stem Down % 
special cases, the purpose of an ite 
and pick out the essential facts. 
al aspect of the problem in a set of de- 
tails that are of no importance. Except for this case, however, the 
item should be written so as to make the nature of the problem 
The less irrelevant reading the examinee 


ability to identify 
appropriate to hide the cruci 


posed as clear as possible. 

has to do, the better. 
Example 

arly 1900's employers were generally hostile to organized 

1 es to try to stop their workers from organizing 


vices was the 


› " 

Poor: During the ег 
labor and used many devic 
labor unions. One of these de 


A. boycott. 

B. black list. 

C. closed shop. 

D. checkoff. 
Beller: A device that has sometimes 
lormation of unions is the 


been used by employers to combat the 


A. boycott. 

B. black list. 
C. closed shop. 
D. checkoff. 

4. Be Sure that There Is One and Only One Correct or Clearly Best 


е ерес that а multiple- 
Answer. It hardly seems necessary 1 


62 PREPARING OBJECTIVE TESTS 


choice item must have one and only one right answer, but in practice 
this is one of the most pervasive and insidious faults in item writing. 
Thus, in the following example, though choice A was probably de- 
signed to be the correct answer, there is a large clement of correct- 
ness also in choices B and D. The item could be improved as shown 
in the revised form. 


1 


ample 


Poor: The provisions of the Wagner Act (National Labor Relations Act) 
were vigorously criticized by 


A. management. 
B. the AF of L. 
C. the CIO. 
D. Congress. 


Better: The provisions of the Wagner Act (National Labor Relations Act) 
were most vigorously criticized by 


A. the National Association of Manufacturers. 
B. the railroad brotherhoods. 

C. the industrial unions in the CIO. 

D. the Democrats in Congress. 


5. Items Designed to Measure Understandings, Insights, or Ability 
to Apply Principles Should Be Presented in Novel Terms. If the 
situations used to measure understandings follow very closely the 
examples used in text or class, the possibility of a correct answer 
being based on rote memory of what was read or heard is very real. 
The second and third variations of the example on p. 60 illustrate 
an attempt to move away from the form in which the concept was 
originally stated. 

6. Beware of Clang Associations. If the stem and the keved answer 
"sound alike," the examinee may get the question right just by using 
these superficial cues. However, superficial associations in the wrong 
answers represent one of the effective devices for attracting those 
who do not really know the fact or concept being tested. This last 
practice must be used with discretion, or one may prepare trick 
questions. 


xam ple 


Poor: In what major labor group have the unions been organized on an in- 
dustrial basis? 


A. Congress of Industrial Organizations. 
B. Railway Brotherhoods. 

C. American Federation of Labor. 

D. Knights of Labor. 


THE MATCHING ITEM 63 


е Better: In what major federation of labor unions would all the workers in a 
given company be likely to belong to a single union? 


A. Congress of Industrial Organizations. 
B. Railway Brotherhoods. 
C. American Federation of Labor. 


D. Knights of Labor. 


~ 


Q6 Beware of Irrelevant Grammatical Cues. Be sure that each option 
is a grammatically correct completion of the stem. Cues from form 


of word ("a" versus "an"). number or tense of verb, etc. must be 


excluded. Note, for example, that in the illustration on p. 60 it is 


necessary to include the article in each of the separate response 


options. 

8. Beware of Cues from the Length of the Options. There is a tend- 
longer than incorrect answers, due 
ns and qualifications that make 
necessary lengthen some of the 


ency for the correct answer to be 
0 the need to include the specificatio 
it true. Examine your items, and if 
distracters (wrong answers). 


THE MATCHING ITEM 

The matching item is actually a special form of the multiple-choice 
item, The characteristic that distinguishes it from the ordinary 
em is that instead of a single problem or stem with 
there are. several problems whose 
le list of possible answers. 
been used to measure 


multiple-choice it 
à group of suggested. answers, 
answers must be drawn from a sing 
item has most frequently 
as the meaning of words, dates of events, 
5 of books or titles with plot or 
ar events, Or association 


The matching 
factual information such 
rs with title 
yciated with particul 
s of chemicals. The matching item 
this type of achievement. 
Effective matching ite built by basing the set of 
chart, map. diagram, or picture of equipment. 
and the examinee may be 

the labels on the figure. 
ests dealing with science 


association of autho 


names as 


characters, 
of chemical symbols wi 
is a compact and efficien 


th name 
t way of measuring 
ms may often be 


items upon a graph, 
Features of the figur 
asked to match names, functions, etc. 
This type of item is particularly uscful in t 
identification of organs in an anatomy test. 
However, there аге many topics to which the matching item is not 
very well adapted. The items making up à set should bear some re- 
other; that is. they should be 
s one would like t 


of the outcome 
items to make up азе 


e may be labeled, 
with 


or technology, e.g 


homogencous. In 
o test, it is diffi- 
t for a match- 


lationship to each 
the case of many 
cult to get enough homogeneous 
Ing item. 


64 PREPARING OBJECTIVE TESTS 


Consider the example that appears below. 


Instructions: In the blank in front of the number of each statement in 
Column I. place the letter of the word or phrase in Column II that is most 


closely related to it. 


Column I Column II 
(C) 1. Organized by crafts. A. Taft-Hartley Act. 
(D) 2. A refusal on the part of employees B. Industrial Revolution. 
to work. C. AF of L. 
(E) 3. First president of the CIO. D. Strike. 
(B) J. Resulted in a change of economic E. John L. Lewis. 
| relationship between employer and 
employee. 
(A) 5. Outlaws the closed shop. 


This example illustrates two of the most common mistakes made 
in preparing matching items. Look at the statements in Column I. 
These statements have nothing in common except that all of them 
refer to labor. The first statement is vague and indefinite in the 
way it is stated but appears to ask for a union organized by crafts. 
Column I includes the name of only one labor organization. Suc- 
cessful matching here requires almost no knowledge on the part of 
the student. Each item in the set can be matched in the same way: 
Note, too, that there are only five choices in Column II to match 
the five items in Column I. If the instructions indicate that each 
answer is to be used only once, then the person who knows four of 
the answers automatically gets the fifth by elimination, and the per- 
son who knows three has a fifty-fifty chance on the last two. 


MAXIMS ON MATCHING ITEMS 


1. When Writing Matching Items, the Items in a Set Should Be 
Homogeneous. For example, they should all be names of labor leaders, 
or all dates of labor legislation, or all provisions of different con- 
gressional acts. 

2. The Number of Answer Choices Should Be Greater Than the Num- 
ber of Problems Presented. This holds except when each answer 
choice is used repeatedly, as in variations that we shall consider 
presently. 

3. The Set of Items Should Be Relatively Short. It is better to 
make several relatively short matching sets than one long one be- 
cause (1) it is easier to keep the items in the set homogencous and 
(2) it is easier for the student to find and record the answer. 


MAXIMS ON MATCHING ITEMS 65 


4. Response Options Should Be Arranged in а Logical Order, if One 
xists. Arranging names in alphabetical order ог dates in chrono- 
logical order reduces the clerical task for the examinec. 

5. The Directions Should Specify the Basis for Matching and Should 
Indicate Whether an Answer Choice May Be Used More Than Once. 
guarantee a more uniform task for all ex- 


These precautions will 
aminces. 


A variation on the matching type of item which is sometimes 
effective is the classification type or master list. This pattern pre- 
sents an efficient means of exploring range of mastery of a concept 


or related set of concepts. 


Example 
Below are given some newspaper reports about actions taken by employees. 
For each of these, you are to mark the action 
A if it is an ordinary strike. 
B if it is a sit-down strike. 
C if it is a jurisdictional strike. 
D if it is a sympathetic strike. 
E if it is none of the above. 
sterday all the government employees in France 
24 hours. 


(8) 1. At 12 noon y 


stopped work for 
At 10:30 this morning all the workers in the Forman plant 


eir machines. They refused to leave the plant. 

(C) 3. An election at the electric appliance factory failed to give a 

HE majority to either the CIO or AF of L. None of the workers 

reported for work today. 

(D) 4. Truck drivers quit today when orc 1 
docks on which the longshoremen were on strike. 


shut down th 
lered to haul freight. from 


made to yield further information about 
each student to classify each action fur- 
Taft-Hartley Act; F, forbidden 
ome doubt whether the 

In this instance, the 
ally increase the diff- 


The above question can be 
the examinees by requiring 
P, permitted by the 


ther by the code } 
U, there is Sí 


by the Taft-Hartley Act; ог t 
Taft-Hartley Act permits this action or not. 
additional task would proba 
culty of the item. А m" — б 

Another setting in which the master list variation of the classifica- 
tion type of item can often be use 
knowledge of the general chronology 


example is shown here. 


bly also substanti 


d to advantage is that of testing 
or sequence of events. An 


66 PREPARING OBJECTIVE TESTS 


Example 


For each event on the left, pick the choice on the right that tells when the 
event took place. 


Event Time 
_ (A) 1. Beginnings of Industrial Revo- Л. Before 1820. 
lution. B. Between 1820 and 1860. 
(D) 2. Formation of CIO. C. Between 1860 and 1915. 
(Hy X Formation of Knights of Labor. D. Between 1915 and 1940. 
(E) 4. Taft-Hartley Act. E. Since 1940. 
(C) 5. The great Pullman strike 


(and possibly others). 


There are a number of other varicties of objective test items that 
have been developed and used to some extent. The reader who is 
interested in a survey of these, together with a more extended dis- 
cussion of teacher-made tests in general, is referred to the suggested 
additional readings at the end of the chapter. 


TESTING FOR UNDERSTANDING 


Since it is casier to construct questions testing factual knowledge 
than those that measure understanding, application of principles, and 
other meaningful outcomes of instruction, teacher-made tests, espe- 
cially of the objective type, tend to emphasize facts. Teachers tend 
to assume that if a student knows the factual material, then he also 
understands that material. Although there is a positive relationship 
between factual knowledge and understanding, the relationship 18 
not perfect. It is true that in order for the student to understand a 
principle, he must have the relevant facts and basic skills. But 
there is no assurance that mere possession of the facts means that 
the student really understands the material. 

If students are to develop understandings, understandings must 
be taught and they must be evaluated. In the measurement of 
understanding, the situations or applications used in evaluation 
should be similar to, but not identical with, the examples used in 
class. If the same situations are used, the student may get the cor- 
rect answer because he has memorized the example given in class. 
not because he understands the principle. 

Objective test items do not divide up into two clearly distinct 
groups, those that measure factual knowledge and those that meas- 
ure understanding, application, or interpretation. Many items in- 
volve understanding and application at various levels as well as the 


TESTING FOR UNDERSTANDING 67 


underlying factual knowledge. Thus, illustration III on p. 60 and 
the matching item on p. 65 both call for applications of knowledge 
about strikes to new situations. Multiple-choice items in particular 
readily lend themselves to testing the understanding and application 
of principles with novel material or in novel settings. 

| Another type of item is the interpretive type item. This tv pe of 
item consists of an introductory selection of material, giving the nec- 
essary background and setting the problem, followed by a series of 
questions asking for interpretations of the material. The introduc- 
tory material can be text, graphs, tables, maps, charts, or any similar 
It can be complete in itself, providing all the necessary 
it can be incomplete so 


material. 
information basic to the understanding, or 
that the student must know certain things in addition to those given. 

Two examples of the interpretive test exercise are given, The first 
is based on a current newspaper item, and the student is not given 
all the essential information but must know certain facts about the 
Taft-Hartley Act in order to answer the question. 

The second item is based on a graph showing certain data about 
and important social and economic 


union membership, strikes, 
depends 


events. In this item, the accuracy of the student's answe 
only upon his ability to understand. the material as it is presented 


to him in graphic form. 


Example 1 
sic on its programs. А con- 
ans' union had required the 


Radio station WIKRN uses only recorded mt 
ation and the musici 


tract between the radio st | nmi d 
station to hire a certain number of musicians, even though the musicians 
never played on any programs. When the contract ended, the radio station 
refused to renew it. Members of the musicians union started to picket the 
radio station headquarters to force it to renew the contract. N 0 ad 
ball season started, members of the union began to picket the local baseba 
park because the ball games of the local team were broadcast over station 
WKRX. The owners of the ball team and of the radio station took the case 
to court and asked the court to rule whether the pikeung was legal. 
What was the most probable ruling by the court? 

; ; aseball park was legal. 
A. Only the picket line at the eball park b S = 
B. Only the picket line at the radio station was legal. 
C. Both picket lines were legal. e 
D. Neither of the picket lines was lega 

г ats rt your answer. 

From the statements below check all that support your а 
prevented by management from using any 


otecting their jobs. 


1. Workers cannot be 
rmits strikes when other means of settling 


peaceful method of pr 
The Taft-Hartley Act pe 


8 کے 


disputes fail. 


68 PREPARING OBJECTIVE TESTS 


3. Secondary boycotts are forbidden by the Taft-Hartley Act. 

eatherbedding" practices by unions are forbidden under the 
Taft-Hartley Act. 

5. Since the picketing of the baseball park was against the radio 
station and not against the baseball team, the owners of the base- 
ball team had no grounds for court action. 

—— 6. Since baseball is a sport, not a business, a baseball park cannot 
be used to force the settlement of a dispute between labor and 
management. 

— 7. Strikes cannot be called against an employer who does not have a 

contract with a union. 


Example II 


10,000 T — 
5 
E 
9000 zt 
© apa 3 S| [$5 
8000 Sis— 8 — 5 — бю |52 
alo SI — ala BIN Бел рш 
2 |2 Elz 382 212 |28 
v & eS a a E 
is —— &I—8|$—8I——8 I8 - 
8 LL 3| /b. è 
E 6000 E 5 FE 
E à 7 E 1 
„ 5000 |- : LII SS m 
2 Š S 
© 
2 
5 4000|— Number of workers | 
E unemployed 
3000 E 
Number of workers E T — 
2000 X unionized ™ 
м; | 
й 7 Number of workers 
1000 ^ on strike | 


3 
— 
wo 
сы 
o 
E 


wo 
g s © 
a o e 


1930 


wo 
со 
o 


1915 
1940 
1945 
1950 


Fig. 4.1. Factors relating to labor organization and labor conflict. 


The following statements refer to Fig. 4.1. Read each statement carefully. 
In front of each statement mark 


А if the statement is supported by the evidence in Fig. 4.1. 
B if the statement is contradicted by the evidence in Fig. 4.1. 
C if the statement is neither supported nor contradicted by the evidence 


in Fig. 4.1. 
. (B) 1. Bad economic conditions tend to produce large numbers of 
strikes. 
. (€) 2. The “New Deal" encouraged workers to join unions. 


__(A) _ 3. The number of strikes increases after a war. 
_ (B) The passage of the Taft-Hartley Act caused а drop in union 
membership. 


* 


GETTING THE OBJECTIVE TEST READY FOR USE 69 


_ (С) _ 5. Ву 1950, the majority of skilled and semiskilled workers in in- 
dustry belonged to unions. 
(B) 6. As the number of unemployed workers increases, membership 
in unions increases. 
(C) 7. The large number of men on strike in 1945 caused the passage 
of the Taft-Hartley Act in 1947. i 
(S. The period between 1920 and 1930 was marked by a steady 
decrease in union membership. i 
CX) 9. The establishment of the CIO was followed by an increase in 


union membership. 


(A) 10. The pattern of num 
that of the number of workers belonging to unions. 


ber of workers out on strike is similar to 


The interpretive type of item provides an opportunity to ask 
meaningful questions about complex data in order to evaluate the 
student's ability to understand and interpret materials. 

However, this item type presents special problems. The introduc- 
tory material must be carefully chosen to elicit the type of under- 
Although a number of sources 
books can be used to furnish 


it usually has to be rewritten 


standing that the teacher desires. 

such as newspapers, magazines, ог 
material for the introductory materials, 
and adapted һу the teacher to keep it at an appropriate reading 
ary parts. The success of this type 
extent upon the adequacy of the in- 


level and to eliminate unnec 
of item is dependent to a large 
troductory material. 

Another disadvantage of the interpretive type of item is the 
Most of these items tend to be long, so that the 


reading load. 
aminated by the reading 


evaluation of understanding will be cont 
level of the student. 

A third disadvantag 
the item and the amoun 
type of item it is not possible to get 
age as with the usual type of multiple-choice item. 

For a more detailed discussion and for more examples of methods 


of measuring understanding in the different. subject-matter fields, 
the reader is referred to the Forty-Fifth Yearbook of the National 
Society for the Study of Education, liste 
ings at the end of the chapter. 


e is the amount of space required to present 
t of time required to answer it. With this 
as many different units of cover- 


GETTING THE OBJECTIVE TEST READY FOR USE 


So far we have considered the problems involved in improving 
the quality of the individual objective test items. Now we must 
give some thought to putting the items together into a test that is 


d in the supplementary read- û 


70 PREPARING OBJECTIVE TESTS 


an effective whole. The quality of the total test will have been 
determined in large measure by the quality of our initial planning 
and by the skill with which we have written the separate test items. 
However, some further suggestions may help in achieving a sound 
and workmanlike product. 


EXTRA ITEMS 

When the items аге originally written it will usually pay to write 
a surplus over the number that will finally be used. Items that 
scem masterworks in the first pride of authorship may show unsus- 
pected flaws when coldly re-examined at a later date. Furthermore, 
some freedom for fitting the final test to the specifications of the blue— 
print is often helpful. A surplus of 20 or 30 per cent is none too much. 


REVIEW AND EDITING 

It is always sound policy, if time permits, to write the items carly 
and put them aside for a while. When reread later, ambiguities will 
appear that were not seen at all when the item was first written. 
Even more helpful, if it is feasible, is to get another person who 
knows the subject matter to go over the items, keying them and 
criticizing them. This type of review will usually bring out a rather 
startling number of points of ambiguity or disagreement. Revision 
of the items in the light of such a critique or climination of items that 
seem not to be salvageable will do much to avoid those debates with 
students and those ill-feclings that are an occasional feature of ob- 
jective examinations. 

FORM OF REPRODUCTION 

Though it is possible to give objective examinations orally, it is 
far from satisfactory to do so. It is demanding upon students! con- 
centration and introduces an element of speed pressure that is quite 
disturbing to some. [t is generally assumed that an objective test 
will be reproduced and that each pupil will have a copy. Gelatin 
duplicating processes are adequate for groups of moderate size, but 
most test makers will prefer to mimcograph the test if facilities for 
mimeographing are available. More important than the process 18 
the quality of the work, both in organizing the layout of the test 
and in typing up the master copy. i 


ORDER AND GROUPING OF TEST ITEMS 


After the items have been edited and those to be included in the 
test have finally been selected, they must be arranged in the order 
in which they are to appear in the test. There are three aspects 


DIRECTIONS 71 


that should be considered and reconciled as far as possible in deciding 
upon the arrangement and grouping of items. 


1. Items in the same form (true-false, multiple-choice, etc.) should 
be grouped together, so that instructions for answering will carry 
throughout the set. i 

2. In general, an attempt should be made to progress from easy 
to more difficult items. This is especially important with younger 
children, who may become discouraged and quit if the early items 
are too difficult. It is also important if time is likely to be limited, 
so that some items will not be reached. These not-attempted items 


should be the more difficult ones that the examinee would not have 


been likely to answer correctly even if he had reached them. 

similar content can well be grouped together. 
Ip to reduce the feeling that the test is made 
It will encourage a more integrated 


3. Items dealing witl 
If this is done, it will he 
up of unrelated bits and pieces. 
attack by the examinee. 


DIRECTIONS 

Clear instructions to the examinees are an important element in a 
well-constructed test. Examinees will usually know the purpose of 
a test, but if it is possible that they may not the purpose should be 
stated. Complete instructions should be given as to how the pupil 
This is particularly important for novel or 
The examinee should be given explicit in- 
bring procedure that will be used, including 
and whether or not a correction will 


is to record his answers. 
unusual item patterns 
formation as to the scc 
the credit for each item or part 
y, (See Scoring.) 

matching items have been given on 


be made for guessit 
Sample sets of directions for 
pp. 65 and 66. 


ms р, all r ` cor- 
For a test made ms that will not be cor 


up of multiple-choice ite 
answer sheets are used, 


and for which separate 


rected for guessing i 
set of directions. 


one might use the following 


Directions: 


Read cach item and decide 
answers the question. 

Mark your answers on 
test booklet. Indicate your answe 
letter corresponding to Your choice. 
the best answer to item 1. black out the 
answer sheet. 

Your score will be the 
tage to answer every question. 

Be sure your name is on your ansa 


which choice best completes the statement or 
answer sheet. Do not mark them on the 
r by blacking out on the answer sheet the 
“That is, if you think that choice B is 
B in the row after No. Т on your 


the separate 


rs, so it will be to your advan- 


ght answe 
re of the right answer. 


number of rt 
are not su 


even if you 
ver sheet. 


74 PREPARING OBJECTIVE TESTS 


is the number of right answers minus the number of wrong answers. 
Thus, if there were 75 true-false items on a test and a student got 
48 right, got 20 wrong. and did not answer 7 of them, his score would 
be 48 — 20 or 28. Note that omits do not count in this formula for 


guessing. 
For a second example, suppose a student took a 60-item multiple- 
choice test in which each item had 5 possible answers. If he got 52 


questions right and 8 wrong, his corrected score would be 


ANALYZING AND USING THE RESULTS OF 
OBJECTIVE TESTS 


Giving the test, scoring it, and recording a score for each pupil 
frequently ends the matter as far as the teacher is concerned. How- 
ever, if the teacher drops the test at this point, he loses much of its 
value. An analysis of the responses of the pupils to the items can 
serve two important purposes. In the first place, the test results 
provide a diagnostic technique for st udying the learnings of the class 
and the failures to learn and for guiding further teaching and study. 
In the second place, the responses of pupils to the separate items 


and a review of the items in the light of these responses provide a 
basis for preparing better tests another усаг. 

The basic analysis that is needed is a tabulation of the responses 
that have been made to each item on the test. We need to know 
how many pupils got each item right, how many chose each of the 
possible wrong answers, and how many omitted the item. It helps 
our understanding of the item if we have this information for the 
upper and lower fractions of the group, and perhaps also for those 
in the middle. From this type of tabulation, we can answer such 
questions as the following for cach item: 


1. How hard is the item? 

2. Does it distinguish between the better and poorer students? 

3. Do all the options attract responses, or are there some that are 
so unattractive that they might as well not be included? 


A simple form can be prepared for recording the responses to each 
item, lik 
card for each item, and then the information can be accumulated in 


that shown in Fig. 4.3. This can be put on a separate 


a permanent item file. This form is planned for a multiple-choice 


ANALYZING AND USING THE RESULTS OF OBJECTIVE TESTS 75 
item with as many as five choices but can be used for true-false 
items by using only the À and B columns. 

To illustrate the type of information that is provided by an item 
certain items from a current-events test are presen ted below, 
analysis of responses for cach item. This test was 


analys 
together with the 
used in 1950 and 1951 with high-school juniors and seniors. The 


count of responses shown is based on 25 cases from the top and 25 


Item: The Taft-Hartley Act outlaws the 


A. closed shop. 


3. open shop. 
C. preferential shop. 


D. union shop. 
Option 
A B c D E Omit 
| Upper 25% 10 
| Middle 5067 17 1 2 
| Lower 2 5 1 1 8 


Fig. 4.3. Form for recording item-analysis data. 


a group of 100. Top and bottom are de- 
That is, test papers were arrange 


from high to low on total score, and the first and last 25 were tallied. 
ng the different options is shown on the right. 
After each item is presented, the 


cases from the bottom of 


fined by total score on the test. 


Frequency. of choosi 
The correct option is underlined. 
meaning of the item data is discussed briefly. 


Item 1 


‘Lost Boundaries” and 


In movies such as‘ 
* Hollywood made a 


“Home of the Brave’ 
determined attack upon | 
Upper Lower 

25 


A. communism. 0 
B. isolationism. 0 ) 
C. racial prejudice. F 2 
D. fascism. . 

hd (Omit) 0 


76 PREPARING OBJECTIVE TESTS 


This is a very easy item, since all 25 in the upper group and 22 in 
the lower group get it right. However, it does differentiate in the 
desired direction, since what errors there are fall in the lower group. 
Two or three easy items like this would be good "ice breakers" to 
start a test. 

Item II 


In most of the strikes in major U. S. 
industries during 1950, the main 
point of dispute was 
Upper Lower 
25 


А. pensions. 13 1 
B. union security. 0 1 
C. wages. 11 19 
D. hours. 1 2 
E. working conditions. 0 1 

(Omit) 0 0 


At the time that it was used, this was a hard item but а very 
effective one. (Of course, its usefulness would change rapidly with 


the passage of time, since current events do not remain current.) That 
it is hard is shown by the fact that only 14 out of 50 got it right. 
That it is effective is shown by the fact that 13 of the 14 were in the 
upper group. All the wrong options attracted some choices in the 
lower group, and they all attracted more of the lower group than 
the higher group. Incidentally, ап item such as this shows how 
faulty the idea of "blind guessing" often is when an item is effec- 
tively written. In this item, the vast majority of the lower group 
concentrated upon one particular wrong option that was particularly 
plausible and appealing. 
Item ПІ 
The term “parity” is prominent in 
price discussions concerning 


Upper Lower 


23 25 

X. manufacturing. 7 6 
B. mining. 10 4 
C. agriculture. 7 9 
D. transportation. 0 2 
(Omit) 1 4 


This item turned out poorly. Only 16 out of 50 got it right, and 
right answers were more frequent in the lower than in the upper 


ANALYZING AND USING THE RESULTS OF OBJECTIVE TESTS 77 
group. As far as the test is concerned, it appears that this ite 
would have to be either discarded or radically revised. If the аа 
was supposed to have learned about parity, this shows dom е 
the learning did not take place. Their responses indicate a need ih 
explore with the group the nature of their confusion and to give addi 
tional attention to clarifying the topic. One can air е ә 
about the reason for the popularity of option B in the upper ien 
to the fact that coal strikes and coul prices 
minent position in the news in the preceding 
thoughts of the respondents. 


It may have been due 
had occupied a fairly pro 
months and were uppermost in the 


Item 1V 


latter part of 1949 the legal 


In the 
orkers 


minimum hourly wage for wi 
Federal jurisdiction was 


under 
changed to 
Upper Lower 
25 25 
0 0 
23 17 
2 7 
0 0 
(Omit) 0 1 


discrimination in the desired direction (23 
versus 17), but the differentiation is not very sharp. The item is of 
а familiar sort. Only two of the four choices are functioning at all. 
Nobody selects either the A or D choice. lf we wished to use this 

н using $0.60, $0.75, $0.90, and S1.00 for 


item again, we might try Á 
ing in this way to make the item both more 


This item shows some 


the four choices, hop 


difficult and more discriminating. 


ich as these can be used not only for evaluating 


review and restudy of the material with a 
difficult for the class as a whole provide 
Discussion of these items with the 
of the misunderstanding. The 
ed up by brief further 


Item statistics st 
the items but to guide 
Class, The items that prove 
leads for further exploration. 
class will throw light on the nature 
misunderstanding may in some cases be clearet . 
discus: ugh in some cases а fuller review of the topic may 
if local policies permit, to let pupils 
a сору of the test and to make the 
v can the 
points they missed. 


ion, althoug 
be indicated. It is desirable. 
have their answer 
answer key available to the 
to review anc 
ach as we 


sheets and 

m mselves use the 
1 clarification 
ll as test. 


so that the 
of the 


2 as a guide 
n examination should te 


78 PREPARING OBJECTIVE TESTS 


SUMMARY STATEMENT 


The deficiencies of essay examinations have led to the preparation 
of tests made up of objective short-answer questions. These ques- 
tions may be prepared in truc-false, completion, multiple-choice, 
matching, and many other forms. Experience of item writers has 
led to the formulation of a number of "do's" and "don't's" to guide 
the preparation of test items. These are considered in detail in this 
chapter. 

Though there is an unfortunate tendency for writers of objective 


items to concentrate on factual information, ability to understand, 


interpret, and apply can be tested by items that follow this format. 
For the measurement of understanding it is often desirable to de- 
scribe a fairly complex problem situation or to present a fairly full 
set of data and to organize a set of related questions about. the 
problem or data. Illustrations are provided. 

It helps, in producing а good test, to prepare extra items and to 
have the items edited and screened before using. Items should be 
grouped so as to emphasize relationships and to provide a general 
progression from easy to more difficult. Answer sheets and scoring 
stencils facilitate scoring. The issue of correction for guessing should 
be resolved in advance, and examinees should be told what procedure 
will apply. 

Test results can be analyzed with profit to guide (1) further teach- 
ing and review and (2) the construction of additional tests in later 
years. 


SUGGESTED ADDITIONAL READING 


Bean, Kenneth L., Construction of educational and personnel tests, New 
York, McGraw-Hill, 1953. 

Ebel, Robert Г... Writing the test item, Chapter 7 in E. F. Lindquist. 
tor, Educational measurement, Washington, D. C., American Council on. Edu- 
cation, 1951. 

Michecls, William J., and M. Ray Karnes, Measuring educational achievement, 
New York, McGraw-Hill, 1950. 

National Society for the Study of Education, The measurement of under- 
standing, The Forty-fiftth Yearbook, Part I, Chicago, Illinois, University ol 
Chicago Pr 1946. 

Odell, C. W., How to improve classroom testing, Dubuque, Iowa, William C. 
Brown, 1953. 

Travers, Robert M. W., How to make achievement tests, New York, Odyssey 
Press, 1950. 

Traxler Arthur E., Administering and scoring the objective test, Chapter 
10 in F F Lindqui t, Editor, Educational measurement, Washington, D. C. 
American Council on Education, 1951. 


QUESTIONS FOR DISCUSSION 79 


Weitzman, Ellis, and Walter J. McNamara, Constructing classroom exami- 
Chicago. Illinois, Science Research Associates, 


nations: a guide for teache 
1949, 


QUESTIONS FOR DISCUSSION 


1. In a junior high school, one teacher takes complete responsibility for 
preparing the common final е nination for all the classes in general science. 
He makes the examination up without consulting the other teachers. What 
advantages and disadvantages do you see in this procedure? 

2. A high-school principal has a system of using a different type of objec- 
tive test item cach month one month it is true-false, the next month multi- 
ple-choice, the next month completion, and so on. Each teacher is expected 
to follow this uniform pattern. How would you evaluate this procedure? 
Why? 

3. What steps can a teacher take to avoid ambiguous items on a test? 

4. Under what conditions would it be important to correct scores on an 
objective test for guessing? 

5. Collect some examples of poor ite 
what is wrong with each item. 


6. Construct. four multiple-choice ite 
ing or application. 

7. Prepare a short objective test for 
plan to teach. Indicate for each item 
evaluate with that item. 

8. What are the argume 


papers to students? | | | : 
9. A fourth-grade teacher has given à test in arithmetic. What analyses 


of the results could the teacher make that would help guide (a) future work 
for the cl as a whole and (b) special assistance given to individual pupils? 

10. A college teacher. has given an objective test to a large class, scored 
the papers, and entered the scores in the class record book. What furtlter 
steps might the teacher take before returning the papers to the students? 


Why 


ms you have seen on tests. Indicate 
ms designed to measure understand- 


a small unit that уоп are teaching or 
the objectives that you are trying to 


nts for and against returning major examination 


Chapter 5 


Elementary Statistical 
Concepts 


INTRODUCTION 


ication, rank- 


In its various forms, measurement results in cla 
ings, or scores. Any attempt to describe, summarize, or compare 
results for individuals or for groups calls for numerical treatment. 
The branch of arithmetic and mathematics that deals with the anal- 
vsis of sets of scores for groups of individuals is known as statistics. 
Every user of tests and measurement devices needs at least a con- 


sume understanding of the basic objectives and techniques of 


descriptive statistics. This is a book on measurement, not a statis- 


tics textbook. Discussion of statistics as such is limited to this 
one chapter. It cannot be expected that study of it will make the 
reader an accomplished statistician. This chapter tries to point 
out to the novice the basic types of questions that the statistician 
tries to answer and the simplest tools that he uses to answer some of 
them. 

Suppose you have prepared tests in reading, arithmetic, and spell- 
ing and given them to the pupils in two sixth grades in your school. 
You have scored the papers and entered the names and scores on 
a record sheet for the two classes. Table 5.1 shows the way the rec- 


Table 5.1. Record Sheet for Sixth Grades at School X 


Test Scores 


Name Reading Arithmetic Spelling 
1. Carol А. 32 3 26 
2. Mary B. 27 27 23 
3. Ruby €. 31 9 29 
4. Alice D., 36 18 27 
5. Theresa E. 47 21 35 
б. Ida F. 42 24 26 
7. Vivian G. 22 4 17 


80 


Table 5.1. 


Name 
Grace H. 


. Opal I. 

. Ursula J. 

. Beatrice К. 
. Karen Le 


Susan M. 
Jane N. 


. Dorothy O. 
. Frances P. 
. Elizabeth Q. 


Pearl К. 


. Joan 5. 

. Nancy T. 
. Judith U. 
2. Edith V. 


Louise W. 
Helen X. 
Martha Y. 
Doris Z. 
James A. 
Albert B. 


. Donald C. 
. Peter D. 
. Samuel E. 


George F. 


. Roger G. 


Newton H. 
Karl I. 


. Isidore J. 
John К. 
Benjamin L- 
. Theodore M. 
. Michael N. 
. Herman O. 
. Charles P. 

. Patrick Q. 

. William R. 
. Martin 5. 

. Frank T. 

„ Ralph U. 

. Thomas V. 

. Henry W. 

. Oscar N. 

. Edward Y. 
. Leonard Z. 


INTRODUCTION 


Test Scores 


Record Sheet for Sixth Grades at School X (Continued) 


Reading 
50 


„ш 
'© 


Wily wwii wi & 
со ~1 бл ~ 


Ao 2 ＋ 


2 


"E 
Ex 


Arithmetic 
42 
18 
2 
10 
13 
20 
15 
19 
2 
48 
41 
41 
40 
24 
24 
18 
12 
26 
12 
29 
16 
7 
29 
36 
10 
14 
18 
12 
30 
9 
15 
38 
20 
15 
39 
33 
6 
26 
20 
20 
29 
25 
19 
19 
19 


Spelling 


ко 2 Ко نا‎ ١ 


81 


82 ELEMENTARY STATISTICAL CONCEPTS 


ord sheet might look. Now, what sorts of questions might vou ask 
ions might you ask the data to 


these data? That is, to what ques 
provide the answers? Before reading further, suppose vou study 
the set of scores and jot down on a piece of scrap paper the ques- 
tions that come to your mind in connection with these scores. See 
how many of the question types you can anticipate. 


A first, rather general question vou might ask is: What is the gen- 
eral pattern of the set of scores? How do they “гип”? What do 
they "look like”? How can we picture the set of reading scores, for 
example, so that we can get an impression of the group as a whole? 
To answer this question we will need to consider simple ways of 
tabulating and graphing a set of scores. 

A second type of question that will almost certainly arise is: 
What is this group like, on the average? Have they done as well on 


the test as other sixth-grade groups? Are they ready for the regular 
sixth-grade instruction and material What is the typical level of 
performance in the group? All these questions call for some single 
score to represent the group as a whole, some measure of the middle 


of the group. To answer this question we shall need to become 
acquainted with statistics developed to represent the average OF 
typical score. 

Third, in order to deseribe your group you might feel a need to 
describe the extent to which the scores spread out away from the 


value. Are all the children in the group about the same. 
so that the same materials and procedures would be suitable for all? 
If not, how widely do they spread out on a given test? How docs 
this group compare with other classes with respect to the spread of 
scores? 

Fourth, you might ask how a particular individual stands on à 
particular test. Thus, vou might want to know whether James А, 
had done well or poorly on the arithmetic test, and if vou decided 
that his score was a good score you might want some way of saving 
just how good it was. You might ask whether James A. did better 
in reading or in arithmetic. To answer this question we need a com- 


mon yardstick in terms of which to express performance in two quite 
different areas. Our need, then, is for some uniform way of express 
ing and interpreting the performance of an individual. How does 
he stand, relative to his group? 

A fifth query is of this type: To what extent did those who ex- 
celled in reading also excel in arithmetic? To what extent do these 


TABULATING AND PICTURING SCORES 83 


two abilities go together in the same individuals? 15 the individual 
who is superior in one likely also to be superior in the other? The 
measures of this going-togetherness we speak of as measures of 
correlation. 

The following sections of this chapter will be devoted to illustrat- 
ing and discussing the routines that statistics has developed for 
There are many more questions that 


answering these questions. 
The most important ones concern the 


arise with respect to data. 
drawing of general conclusions from a set of data. Thus, one sample 
of 50 boys may have surpassed a sample of 50 girls from the same 
school on a history test. This is a descriptive fact true of these par- 
ticular groups. We would like to know whether we can safely con- 
clude that the zotal population of boys from which this sample was 


drawn would surpass the total population of girls on this same test. 
Problems of statistical inference 


This is a problem of inference. 
make up the bulk of advanced statistical work, but we cannot go 


into them here. 


WAYS OF TABULATING AND PICTURING A 
SET OF SCORES 

t on which test scores for 

Let us look at the scores 

could be re- 


In Table 5.1 we showed a record shee 
52 sixth-grade pupils had been recorded, 
Reading and consider how they 


in the column headed 1 р 
of how the pupils did on 


arranged so as to give us a clearer picture 
the reading test. 

The simplest. rearrange 
order from highest to lowest. 
looked like this: 


ment would be to just arrange them in 
We might then have something that 


m 48 37 34 30 22 
56 42 37 33 29 22 
is a] 36 32 20 21 
50 40 36 32 28 21 
50 39 36 : 55 20 
47 38 36 31 27 5 
46 38 35 31 a x: 
44 38 35 31 25 

43 31 34 30 24 


what better picture of the way the 
hest and lowest scores at a glance, 
that the middle person in the 


We can see by inspection 


This arrangement gives a some 


scores fall. We can see the hig 
It is also easy to see 


i. e., 59 and 17. К is 
mid-thirties. 


group falls somewhere in the 


84 ELEMENTARY STATISTICAL CONCEPTS 


that roughly half the scores fall between 30 and 40. But this simple 
rearrangement of scores still has too much detail for us to see the 
general pattern clearly. It is also not a convenient form to use in 
computing. We need to condense it into a more compact form. 


PREPARING A FREQUENCY DISTRIBUTION 

A further step in organizing the scores for presentation is to pre- 
pare what is termed a frequency distribution. This is a table showing 
how often each score occurred. Each score value is listed, and the 


number of times it occurred is shown. A portion of the frequency 
distribution for the reading scores is shown in Table 5.2. However, 


Table 5.2. Frequency Distribution of Reading Scores 


(Ungrouped Data) 


Test Score Frequency 
59 1 
58 0 
57 0 
56 1 
55 0 
54 0 
53 0 
52 1 
51 0 
50 2 
20 1 
19 0 
18 0 
17 2 


Table 5.2 is still not a very good form for reporting our facts. The 
table will be too long and spread out. We have shown only part 
of it. The whole table would take 43 lines. It would have a number 
of zero entries. There would be marked ups and downs from one 
score to the next. 

In order to improve the form of presentation further, scores are 
often grouped together into broader categories. In our example, We 
will group together three adjacent scores, so that each grouping in- 
cludes three points of score. When we do this, our set of scores 18 
represented as shown in Table 5.3. This provides a fairly compact 
table showing how many scores we have in each group or class inter- 


PREPARING A FREQUENCY DISTRIBUTION 85 


val. Thus, we have eight scores in the interval 34-36. We do not 
know how many of them are 34's, how many 35's, and how many 
36's. We have lost this information in the grouping. We assume 
that they evenly divided. Та most cases, when there is no rea- 
son to anticipate that any one score will occur more often than anv 
1 is a sound опе, and the gain in compactness 
than makes up for апу slight 


other, this assumptior 
and convenience of presentation more 
inaccuracy introduced by this grouping. * 

Table 5.3. Frequency Distribution of Reading Scores 


(Grouped Data) 


Score Interval Tallies Frequency 
58-60 / 1 
55-57 / 1 
52-54 / 1 
49-51 Il 2 
46-48 Il 2 
43-45 Ill 3 
40-42 Ill 3 
37-39 ll 7 
34-36 Will 8 
31-33 ull 7 
28-30 Т] 5 
25-27 ШШ 4 
22-24 Ill 3 
19-21 Ill 3 
16-18 Il 2 


In a practical situation, we always face the problem of deciding 
how broad groupings should be, i.e., whether to group by 3's, 5’s, 
10's, or some other grouping. The decision is a compromise between 
losing detail from our data, on the one hand, and obtaining a con- 
and smooth representation of our results, on the 
other. A broader interval loses more detail but condenses the data 
into a more compact picture. A practical rule-of-thumb is to choose 
a class interval that will divide the total score range into roughly 


venient, compact, 


15 groups. ў 
Thus, in our example the highest score was 59 and the lowest 


was 17. The range of scores is 59 — 17 = 42. Dividing 42 by 15, 
we get 2.8. The nearest whole number 18 3, and so we group our 
ial statistics, such as reports of income, certain 


. $2000, $3000, $5000, etc. Special precau- 
In particular, one should 


In some special types of soc 


values are more likely than others, le. З 
g material of this type. 


he middle of a class interval. 


tions are necessary in groupin 
strive to get popular values near t 


86 ELEMENTARY STATISTICAL CONCEPTS 


data by 3's. In addition to the "rule of 15," we also find that 5, 10, 
and multiples of 10 are convenient groupings. Since the purpose 


of grouping scores is to make a convenient representation, factors 


of convenience enter as a major consideration. 

It should be noted that sometimes there is no need to group 
data into broader categories. If the original scores represent а 
range of no more than, sav, 20 points, grouping may not be called for. 

In practice, when we are tabulating a set of data, deciding on the 
size of the score interval is the first step. Next we set up the score 
intervals, as shown in the left-hand column of Table 5.3. Each 
individual is then represented by a tally mark, as shown in the 
middle column. (It is casier to keep track of the tallies if every 
fifth tally is a diagonal line across the preceding four.) The column 
headed Frequency is gotten by counting the number of tallies in 
each score interval. 


GRAPHIC REPRESENTATION 

It is often helpful to translate the facts of Table 5.3 into a pic- 
torial representation. A common type of graphic representation, 
which is called a histogrant, is shown in Fig. 5.1. This can be thought 


Number of cases 
> 


16-18 19-21 22-24 25-27 28-30 31-33 34-36 37-39 40-42 43-45 46-48 49-51 52-54 55-57 58-60 
Score interval 


Fig. 5.1. Histogram of reading scores. 


of, somewhat grimly, as “piling up the bodies." The score inter- 
vals are shown along the horizontal base-line (abscissa). The verti- 
cal height of the pile (ordinate) represents the number of cases. The 
diagram indicates that there are two "bodies" piled up in the inter- 
val 16-18, three in the interval 19-21, and so forth. This figure 


MEASURES OF CENTRAL TENDENCY 87 


gives a clear picture of how the cases pile up, with most of them in 
the 30's and a long low pile running up to the high scores. 

Another way of picturing the same data is by preparing a fre- 
quency polygon. This is shown in Fig. 5.2. Here we have plotted a 


*. 
7——— 1—1 =a 
| | 
n +— t 
D | 
$5 = 
5 | 
54 ل‎ 
2 | 
БВ i 
z | 
2— 1 
| 
1— L 
| 


Tele 192] 2224 252] 2830 3133 3436 3739 4042 4345 4648 4951 5254 5557 58-60 
Score interval 


Fig. 5.2. Frequency polygon of reading scores. 


point at the mid-point of each of our score intervals. The height at 
plotted the point corresponds to the number of 
in the interval. These points have been 
connected, and the jagged line provides a somewhat different picture 
of the same set of data illustrated in H 5.1. Histogram and fre- 
Шу interchangeable ways of showing the 


which we have 
cases, or frequency (f). 


queney polygon are essentiz 
same facts. 


MEASURES OF CENTRAL TENDENCY 


We often need a statistic to represent. the typical, or average, or 
middle score of a group of scores. A very simple way of identifving 
the typical score is to pick out the score that occurs most frequently. 
This is called the mode. If we examine the array of scores on p. 83, 
we see that the score 36 occurs 4 times and is the mode for this set 
of data. We can also note another fact. The score values 38, 37, 
32, 31, and 27 each occur 3 times. If there were 1 less 36 and 1 more 
27, for example, the mode would shift by 9 points. The mode is 
sensitive to such minor changes in the data and is therefore a crude 
of the typical score. In Table 5.3, 


and not very useful indicator я | 
distribution, the modal inter- 


where we have the grouped frequency 
val is the interval 34-36. This is as closely as we can identify the 


mode for data presented in this мау. 


88 ELEMENTARY STATISTICAL CONCEPTS 


MEDIAN 

A much more useful way of representing the typical or average 
score is to find the value on the score scale that separates the top 
half of the group from the bottom half. This is called the median. 
In our example, in which we have 52 cases, we want to separate the 
top 26 from the bottom 26 pupil The required value can be esti- 
arting with the lowest 
ry 26 cases. Table 5.4 


mated from the scores shown in Table 5.3. S 
score, we count up until we have the nec 


Table 5.4. Frequency Distribution and Cumulative Frequencies for 
Reading Scores 


Score Cumulative 
Interval Frequency Frequency 
58 60 1 52 
55-51 1 51 
52-54 1 50 
49 51 2 49 
46 48 2 47 
43 45 3 45 
40 42 3 43 
37 30 7 30 
34 30 8 a 
31-33 7 M 
28 30 5 17 
25-27 + 12 
22-24 3 8 
19 21 3 5 
16 18 2 2 


shows the cumulative frequencies as well as the frequency. in each 
interval. Each entry in the column labeled Cumulative Frequency 
shows the total number having a score equal to or less than the high- 
est score in that interval. That is, there are 5 cases scoring at or 
below 21, 8 scoring at or below 24, 12 scoring at or below 27, and sO 
forth. As indicated, we wish to identify the point below which 50 
per cent of the cases fall. Since 50 per cent of 52 = 26, we must 
identify the point below which 26 pupils fall. 

We note that 24 individuals have scores of 33 or below. We 
need to include 2 more cases to make up the required 26 cases. Note 
that in the next score interval (34-36) there are 8 individuals. We 
require only 28 or !4 of these individuals. Now how shall we think 
As 


we indicated on p. 85, a reasonable assumption is that they are 


of these cases being spread out over the score interval 34 36? 


spread out evenly over the interval. To include lg of the scores, 


PERCENTILES 89 


we would than have to go 14 of the way up from the bottom of the 
interval toward the top. 

At this point we must define what we mean by a score of 34. In 
the first place, let us note that although test scores go by jumps of 
1 unit, ie., 34, 35, 36, we consider the underlying ability to have a 
continuous distribution taking all intermediate values. Thus, we 
do not get a score of 34.27, but this is only because our test does not 
register that precisely. Our definition will be that a score of 34 means 
closer to 34 than to cither 33 or 35. That is, 34 will mean from 
This definition is somewhat arbitrary but is rather 
generally accepted in statistics textbooks. Our class interval $1-36 
is really to be thought of as extending from 3316 to 3615. Since 
we require HH of the cases in this interval, we have 14 (3614 — 3314) = 
A X 3 = 34 = 0.75. We must add 0.75 to the value 331%, which 
The median for this set 


9281 ad : 
3315 to 3414. 


is the borderline between the 2 intervals. 
of scores is 33.5 + 0.75 = 34.2 


Pr 


To compute the median, then, 

of cases that represent 50 per cent of 
50 per cent of 52 is 26. 

rough cach score interval. The 
5.4, are 2, 5, 8, 12, 17, 


1. Calculate the number 
the total group. In our example 

2. Accumulate the scores up th 
cumulative frequencies, as shown in Table 
Ste. 

3. Find the 
less than the rec 


interval for which the cumulative frequency is just 
juired number of cases. In our example the cumula- 
tion through the 31-33 interval is 24. 

4. Find the score distance to be added to the top of this interval, 
in order to include the required number of cases, by the following 


operation: 


Number of additional cases a) jer s of score ) 
Number of cases in next interval points In interval 


In our example this becomes (28) (3) = 0.75. | 
5. Add this amount to the upper limit of the interval. We have 


for our data 33.5 + 0.75 = 34.25. This score is the median, the 


score below which 50 per cent of the cases fall. 


PERCENTILES 
The same procedure may be used to find the score below which 


group falls. These values are all called 


any other percentage of the ese | 
percentiles. The median is the 50th percentile, i.c., the score below 


90 ELEMENTARY STATISTICAL CONCEPTS 


which 50 per cent of individuals fall. If we want to find the 25th 
percentile, we must find the score below which 25 per cent of the 
cases fall. Twenty-five per cent of 52 is 13. Thirteen cases take us 
through the interval 25-27, and include 1 of the 5 cases in the 28-30 
interval. So the 25th percentile is computed to be 27.5 + (15)3 = 
27.5 + 0.6 = 28.1. Other percentiles can be found in the same 
way. Percentiles have many uses, especially in connection with 
test norms and the interpretation of scores. 


ARITHMETIC MEAN 

Affother frequently used statistic for representing the middle of 
a group is the familiar "average" of everyday experience. Since 
the statistician speaks of all measures of central tendency as aver- 
ages, he identifies this one as the arithmetic mean. This is just the 
sum of a scries of scores divided by the number of scores. Thus, 
the average of 4, 6, and 7 is 


In our example, we can add together the scores of the 52 individuals 


Table 5.5. Frequency Distribution of Reading Scores Showing Steps in 
Calculating Arithmetic Mean and Standard Deviation 


Score Frequency 
Interval F x! n fix’)? 
58-60 1 8 8 64 
55-57 1 7 7 49 
52-54 1 6 6 36 
49-51 2 5 10 50 
46-48 2 4 8 32 
43-45 3 3 9 27 
40-42 3 2 6 12 
37-39 7 1 7 7 
+61 
34-36 8 0 
31-33 7 -1 —7 7 
28-30 5 m —10 20 
25-27 4 —3 —12 36 
22-24 3 —4 lS 48 
19-21 а —5 —15 75 
16-18 2 —6 —12 72 
—68 


Sum 352 —1 535 


ARITHMETIC MEAN 91 


in our group. This gives us 1798. Dividing by 52, we get 34.58 for 
the "average" or arithmetic mean for this group. ! 

| Adding together all the scores and dividing by the number of cases 
is the straightforward way of computing the arithmetic mean. How- 
ever, it can be rather laborious, especially with a large group. More 
efficient computing procedures are available, based on the frequency 
distribution given in Table 3.3. These calculations are based on a 
type of "trial balance." Picking a score interval that looks to be 
about in the middle of the group, we sum the plus and minus devia- 
tions from this starting place. An adjustment based on the excess 
of plus or minus deviations and applied to this starting place gives 
the value for the mean. The application of this procedure to the 
reading test data is shown in Table 5 and the steps are outlined 


below. 


. 1. Choose some interval for the arbitrary starting place or "ori- 
gin." In this example the interval 34-36 has been chosen. Call 
this interval zero. (Note: Any interval can be chosen, and the final 
result will be the вате. The particular interval chosen is purely a 
matter of convenience.) 

2. Call the next higher interval +1, the one above that +2, 
ete.; call the next lower —1, the one below that —2, etc. These 
are shown in the column labeled x. This column indicates the num- 
ber of interval steps each interval is above or below our chosen 


starting point. 
3. For each row, multiply the 


number of cases (frequency) by the 
number of steps (*) above or below the chosen origin. These prod- 
in the column headed fx’. Note the minus 


ucts give the values 
(Ignore the column headed 


signs in the lower half of the column. 
f(x?! for now. It refers to a later topic.) 


4. Sum the values in the fx’ column, taking account of the plus 


(Mistakes will be avoided if the plus entries are 
the minus entries summed, and then the two 
al total.) 

1 headed Frequency (or pn 
This is usually la- 


and minus signs. 
summed separately, 
part sums combined to give the fin 

5. Sum the frequencies in the columr 
to give the total number of cases in the group. 
beled М. 

6. Divide the sum of the fx 
s in each interval. 
e mid-point of the 
s negative, adding 


values by N. Multiply by the num- 
Add the result to the score 
zero interval. (Note that if 
it becomes in effect subtrac- 


ber of score point 
corresponding to th 
the sum in step 41 
tion.) 


92 ELEMENTARY STATISTICAL CONCEPTS 


These operations can be expressed by the following formula: * 


Sum of fx’ . Ек 
Меап = M xm (Interval) + Arbitrary origin 


In our illustration the values become 


Mean = (=) (3) + 35 


342 


(—0.134)(3) + 35 
— 0.40 + 35 


ll 


= 34.60 


Starting where we did, the minus deviations slightly overbalanced 
the plus ones. There was an excess of 7 on the minus side. Our 
starting point was a little too high. We had to shift it down 752 
of 1 interval or 752 X 3 points of score to find a true balance point. 


Since the middle of our zero interval corresponded to a score of 35, 
we had to move down 2152 points below 35 to get the true balance 
point, the correct arithmetic mean. 

The value 34.60 that we got in this way is almost the same as the 
34.58 that resulted from adding all the scores together and dividing 
by the number of cases. The correspondence is usually not perfect, 
due to slight inaccuracies involved in grouping our scores into classes 
in the frequency distribution, but the values obtained by the two 
methods will always agree closely. It makes no difference which 
interval we use for our starting point. Barring mistakes in arith- 
metic, we will always get identically the same result. 

The arithmetic mean and the median do not correspond exactly 
but usually they will not differ greatly. In this example, the values 
are 34.60 and 34.2 
substantially only when the set of scores is very “skewed,” ie, there 
is a piling up of scores at one end and a long tail at the other. Fig. 
5.3 shows three distributions differing in amount and direction of 
skewness. The top figure is positively skewed, i.e., has a tail run- 
ning up into the high scores. We get a distribution like this for 
income in the United States, since there are many people with small 
and moderate incomes and only a few with very large incomes. The 


‚ respectively. The mean and median will differ 


center figure is negatively skewed. A distribution like this would 


* A list of common statistical symbols and their meanings is given at the end of 
the chapter. Reference to these definitions may help in reading the remainder ol 
the chapter. 


MEASURES OF VARIABILITY 93 


result if a class was given a very easy test, which resulted in a piling 
up of perfect and near-perfect scores. The bottom figure is sym- 
metrical and is not skewed in either direction. Many physical and 
psychological variables give such a symmetrical distribution. In 
the many distributions that are approximately symmetrical either 


Positively skewed 


Arithmetic mean 


Negatively skewed 


Median and 


Symmetrical 


Fig. 5.3. Frequency distributions differing in skewness. 


erve equally well to represent the average of 
wed distributions the median generally seems 
cted by a few cases out in the long tail. 


mean or median will 8 
the group, but with ske 
preferable. It is less affe 


MEASURES OF VARIABILITY 
a set of scores, it is often significant to report 
how much they spread out from high to 
groups of children, both with a me- 
dian age of 10 years would represent quite different educational 
situations if one had a spread of ages from 9 to 11w hile the other 
ranged from 6 to 14. A measure of this spread is an important 
statistic for describing a grouP- - 

ariability 


A very simple measure of v ) 
1 the difference between the highest and the 


When describing 
how variable the scores аге, 
low scores. For example, two 


is the range of scores in the 


group. This is simply 


94 ELEMENTARY STATISTICAL CONCEPTS 


lowest score. In our reading test example it is 59 — 17 = 42. How- 
ever, the range depends only upon the 2 extreme cases in the total 
group. This makes it very undependable, since it can be changed 
a good bit by the addition or omission of a single extreme са 


SEMI-INTERQUARTILE RANGE 

A better measure of variability is the range of scores that includes 
a specified part of the total group—usually the middle 50 per cent. 
The middle 50 per cent of the cases in a group are the cases lving 
between the 25th and 75th percentiles. We can compute these two 
percentiles, following the procedures outlined on pp. 88 89. For our 


Large variability 


Small variability 


Fig. 5.4. Two distributions differing only in variability. 


example, the 25th percentile was computed to be 28.1. If we calcu- 
late the 75th percentile, we will find that it is 39.5. The distance 
between them is 11.4 points of score. 

The 25th and 75th percentiles are called quartiles, since they cut 
off the bottom quarter and the top quarter of the group respectively. | 
The score distance between them is called the interquartile range. 
A statistic that is often reported as a measure of variability is the 
semi-interquartile range (Q). This is half of the interquartile range. 
It is the average distance from the median to the 2 quartiles, 1.0. 
it tells how far the quartile points lie from the median, on the aver- 
age. In our example, the semi-interquartile range is 


STANDARD DEVIATION 95 


If the scores spread out twice as far, Q would be twice as great; if 
they spread out only half as far, Q would be half as large: Two 
distributions that have the same mean, same total number of cases 
and same general form, and that differ only in that one һаз а variabil- 
ity twice as large as the other are shown in Fig. 5.4. 


STANDARD DEVIATION 

The semi-interquartile range belongs to the same family of statis- 
tics as the median. Its computation is based upon percentiles. There 
are also measures of variability that belong to the family of the 
arithmetic mean and are based upon score deviations. Suppose we 
had 4 scores which were 4, 5, 6, and 7 respectively. Adding these 
together and dividing by the number of scores we get 


4+5+6+7 8 
8 = 5.5 


But now we ask how widely 


This gives us the arithmetic mean. 
Suppose we find 


these scores spread out around that mean value. 
the difference between each score and the mean, ie., we subtract 
5.5 from each score. We then have — 1.5, —0.5, 0.5, and 1.5. These 
dalions of the scores from the mean. The bigger the 


represent dez 
What we require is 


deviations, the more variable the set of scores. 
some type of average of these deviations to give us an over-all meas- 
ure of variability. 

If we simply sum the 
This is necessarily so. 
around which the plus and minus deviations 
shall have to do something else. The proce- 
dure that statisticians have devised for handling the plus and the 
all the deviations. (A minus times a minus 
d deviations is obtained by 
of cases. To compen- 
square root of this 


above 4 deviation values, we find that they 


add up to zero. We defined our arithmetic 


mean as the point 
exactly balance. We 


minus signs is to square 

An average of these square 
them and dividing by the number 
ng the individual deviations, the 
The resulting statistic is called the 
It is the square root * of the average 
For our little example 


is a plus.) 
summing 
sate for squari 
average value is computed. 
standard deviation (S.D. or s). 
of the squared deviations from the mean. 
of 4 cases, the calculations are shown on the next page. 


* The steps for computing the square root are shown in Appendix I. 


96 ELEMENTARY STATISTICAL CONCEPTS 


[(=1.5)? + (—0.5)? + (0.5)? + (1.5)? 
4 4 
+ 0.25 +225 65 
4 N4 


М 


= 1.25 = 1.12 


Un 


STANDARD DEVIATION COMPUTED FROM FREQUENCY DISTRIBUTION 


The standard deviation may also be computed from the grouped 
frequency distribution. The necessary steps have been carried out 
in Table 5.5. Take special note of the column headed f(x’)?. Each 
entry in this column represents the number of cases (f) multiplied 
by the square of the deviation (x^) of that score interval from the 
arbitrary origin. The sum of the values in this column gives a sum 
of squared deviations, but these deviations are around our arbitrary 
origin and are expressed in interval units. Several adjustments are 
necessary to express the deviations in score units and in terms of 
the (rue arithmetic mean. The steps are outlined below. 

1. Carry out the operations for computing the arithmetic mean, as described 
on pp. 91-92. 

In addition, prepare the column headed f(x’)*. Each entry in this column 
is the frequency (f) times the square of the deviation value (^). However: 
this last column can be computed most simply by multiplying together the 
entries in the two preceding columns, i.e, x^ times fx’. Note that all the 
signs in this column are positive, since a minus times a minus gives a plus. 


In symbolism Illustrative example 
3. Get the sum of the Ax“)? 535 
f(x’)? column. (“The 
sum of” will be indi- 
cated by X.) 
4. Divide this sum by Di’)? 535 
E — кч» Em .. = 10.288 
the number of cases. N 5) 
5. Divide the sum of the p = _ 
TE” я " 7 „= —0.135 
fx’ column by the N 52 
number of cases. 
6. Square the value ob- SV -7\2 E 
daas nar м v pi S 
tained in 5 above. N 82 
= 0.018. 
7. Subtract the value in zf) 2 j 
; — — 10,288 — 0.018 
6 from that in 4. м 10.28 


= 10.270 


INTERPRETING THE STANDARD DEVIATION 97 


Illustrative example 

8. Take the square root 
of the value in 7. 4/10.270— 3.20 

9. Multiply by the num- 
ber of score points in 
each class interval. 
(We call this width of 
interval i.) 


3(3.20) — 9.60 


Presenting all the computations for our example in summary form, 
using the formula given in step 9 above, we have 


Bs (=> 
S.D.-3.| ) 9.60 


INTERPRETING THE STANDARD DEVIATION 

It is almost impossible to say in any simple terms what the stand- 
ard deviation is or what it corresponds to in pictorial or geometric 
a statistic that characterizes a distribution 
proportion as the scores spread out 


terms. Primarily, it is 
of scores. It increases in direct 
more widely. The larger the standard deviation, the wider the spread 
of scores. “A student sometimes asks: But what is a small standard 
deviation? What is a large one? There is really no answer to this 
question, Suppose that for some group the standard deviation of 
weights is 10. Is this large or small? It depends on whether we are 
talking about ounces, Or pounds, or kilograms. It depends upon 
whether we are dealing with the weights of mice, or men, or mam- 
moths, Large and small have only relative meaning —i.e., larger or 
smaller than that found for some other group or with some other test. 

The standard deviation gets its most clear-cut meaning for one 
of distribution of scores. This distribution is called 
the "normal" distribution. It is defined һу а particular mathe- 
matical equation, but to the everyday user it is defined approxi- 
mately by its pictorial qualities. The "normal" curve is a sym- 
cal с a bell-like shape. That is, most of the scores 
oes away from the middle 
t slowly and then more 


particular type 


metrical curve having 
pile up in the middle score values; as one g 
in either direction the pile drops off, firs 
rapidly, and the cases tail out to relatively long tails on either end. 
An illustration of a typical normal curve 1$ shown in Fig. 5.5. This 
curve is the normal curve that best fits the reading test data we 
have been using as an illustration. It has the same mean, standard 


deviation, and total area (number of cases) as the reading test data. 


98 ELEMENTARY STATISTICAL CONCEPTS 


The histogram of reading test scores appears in light dotted lines, 
so one can see how closely the curve fits the actual test scores. 
For the normal curve, there is an exact mathematical relationship 


between the standard deviation and the proportion of cases. The 
$5 
Ц 1 
' ' 


pm ее 


-2SD -1SD Mean +150 +250 


Fig. 5.5. Example of а normal curve (fitted to reading test dato). 


same proportion of cases will always be found within the same 
standard deviation limits. This relationship is shown in Table 5.6. 


Table 5.6. Proportion of Cases Falling within Certain Specified 
Standard Deviation Limits for a Normal Distribution 


Per Cent 


Limits within Which Cases Lie of Cases 

Between the mean and either 

+1.0 S.D. or —1.0 S.D. 34.1 
Between the mean and either 

+2.0 S.D. or — 2.0 S.D. 47.7 
Between the mean and either 

+3.0 S.D. or — 3.0 S.D. 49.9 
Between +1.0 and — 1.0 S.D. 68.2 
Between 4-2.0 and — 2.0 S.D. 95.4 
Between +3.0 and —3.0 S.D. 99.8 


Thus, in апу normal curve about two-thirds (68.2 per cent) of the 
cases will fall between +1 and —1 standard deviation from the 
mean. Approximately 95 per cent will fall between +2 and —2 
standard deviations from the mean, and very nearly all the cases will 
fall between 4-3 and — 3 standard deviations from the mean. An in- 
dividual who gets a score 1 standard deviation above the mean will 
surpass 84 per cent of the group, i.c., he will surpass the 50 per cent 


INTERPRETING THE STANDARD DEVIATION 99 


who fall below the mean and the 34 per cent who fall between the 
mean and +1 standard deviation. 

This unvarying relationship of the standard deviation unit to the 
normal distribution gives the standard 


arrangement of scores in the 
It becomes a yardstick in 


deviation a type of standard meaning. 
terms of which different groups may be compared or the status of a 
given individual may be evaluated. Although the relationship of 
the standard deviation unit to the score distribution does not hold 
exactly in distributions other than the normal distribution, frequently 
the distribution of test or other scores approaches the normal curve 
so that the standard deviation continues to have 


closely enough 
very nearly the same meaning. 

The meaning of being a given number of standard deviations 
mean may be expressed in terms of the per cent 
the individual surpasses. A number 
iven in Table 5.7. This table 


above or below the 
of cases in the group whom 
of values for this relationship are & 
Table 5.7. Per Cent of Group Falling below Selected Standard Deviation 
Values for Normal Curve 


Per Cent 


Standard Having Scores 

Deviation Value below This Value 
+3.0 99.9 
+2.5 99.4 
42.0 97.7 
+1.5 93.3 
+1.0 84.1 
+0.5 69.1 
0.0 50.0 
—0.5 30.9 
—1.0 15:9 
—1.5 6.7 
—2.0 2.3 
5 0.6 
A 


i 


g any particular score. Consider 
ed the mean and 
Suppose a person 


provides a basis for interpretin 
the set of reading test scores for which we comput 
standard deviation to be 34.6 and 9.6 respectively. 
Since the mean of the group is 34.6, he falls 45 — 
34.6 — 10.4 points above the mean of the group. Тһе 10.4 points 
s the mean is equal to 10.4/9.6 = 1.08 standard 
lard deviations above the mean. We 
r cent of the cases 


had a score of 45. 


by which he surpasse 
deviations. He is 1.08 stanc 


might expect him to surpass approximately 85 pe 


100 ELEMENTARY STATISTICAL CONCEPTS 


in our group. (Ап actual count shows that this score is better than 
4555 = 86.5%% of the scores in our set of data.) A score expressed 
in standard deviation units has much the same meaning from one 
set of scores to another, and these units are directly comparable 


from one mcasure to another. 
In summary, the statistics most used for describing the variability 
are the semi-interquartile range and the standard 


of a set of scores 
deviation. The semi-interquartile range is based upon percentiles, 
i.c. the 25th and 75th percentiles, and is commonly used when the 
median is being used as a measure of the middle of the group. The 
standard deviation is a measure of variability that goes with the 
arithmetic mean. It is useful in the field of tests and measurements 
primarily as providing a standard unit of measure having comparable 
meaning from one test to another. 


INTERPRETING THE SCORE OF AN INDIVIDUAL 


The problems of interpreting the score for an individual will be 
treated more fully in Chapter 7, when we turn to test norms and units 
of measure. It will suffice now to indicate that the two sorts of 
measures we have just been considering, i.e., percentiles and stand- 
ard deviation units, each give us a framework in which we can view 
the performance of a specific person. Thus, referring to the example 
we worked out, if a new boy in the class got a score of 45 on the 
reading test we could say either 


a. That he surpassed 86 per cent of the group, i.e., that he fell 
at the 86th percentile, or 
b. That he fell 1.08 standard deviations above the mean. 


Either statement gives his score meaning in relation to his group: 
he is clearly above average but not one of the few best ones in the 
group. Since they are based on the same score, they are two ways 
of saying the same thing. Each has certain advantages, which we 
will examine more carefully in Chapter 7. 


MEASURES OF RELATIONSHIP 


We look now for a statistic to express the relationship betwee" 
two sets of scores. Thus, in our illustration we have a reading score 
and an arithmetic score for each pupil. To what extent did those 
pupils who did well in arithmetic also do well on the reading test? 
In this case, we have two scores for each individual. We can pic- 


MEASURES OF RELATIONSHIP 101 


ture these scores bv a plot in two dimensions. This is shown in 
Fig. 5.6. The first person in our group. Carol A, had a score on the 
nd a score on the arithmetic test of 3. Her 
X in Fig. 5.6, plotted at 32 on the 
horizontal or arithmetic 


reading test of 32 a 
scores are represented by the 
vertical or reading scale and at 3 on the 
scale. There is a dot to represent each other child's scores. 


Reading score 


0 4 8 Iie oo 09 28 32 36 40 E 52 
Arithmetic score 


Fig. 5.6. Plot of reading versus arithmetic scores. 


Il in reading also does well in arithmetic, 
upper right hand part of our picture. A 
sts will fall at the lower left. Where 
score on the other, we will 
i.e., upper left and lower 
tendency for the scores 


If a child who does we 
we will find his score in the 
child who does poorly on both te 
test goes with poor 
find the points falling in the other corners, 


right. Inspection of Fig. 5.6 will show some 
to splatter out in the lower-left to upper-right direction, i.e., from 


low-low to high-high. But there are many exceptions. The rela- 
tionship is far from perfect. It is a matter of degree. * e necd some 
type of statistical index to express this degree of relationship. 

As an index of this degree of relationship, a statistic known as 
the correlation coefficient can be computed. (The symbol r is used 
to designate this coefficient.) This coefficient can take values rang- 
ing from +1 through zero to 7 1. A correlation of +1 signifies that 
the person who had the highest score on one test also had the high- 


good score on one 


102 ELEMENTARY STATISTICAL CONCEPTS 


est score on the other, the next highest on one was the next highest 
on the other, and so on, exactly in parallel through the whole group. 
A correlation of —1 means that the scores go in exactly the reverse 
direction, i.e., the person highest on one is lowest on the other, next 
highest on one is next lowest on the other, cte. A zero correlation 
represents a. complete lack of relationship. 


r represent tendencies for relationship to exist but with many dis- 


In- between values of 


crepancies. 


Figure 5.7 illustrates four different levels of relationship. 


A the correlation is zero, 


In box 


and the points scatter out in a pattern 


High E High 
Low E Low ee 
low Correlation of 0.00 High Low Correlation of 0.30 Н!" 
A B 
High High a 
* m Ws 
EA code T 
140%, PES Lowl* 
Low Correlation of 0.60 den low Correlation of 0.90 НЕ" 


с 


р 


Fig. 5.7. Distribution of scores for representative values of correlation coefficient. 


that is just about. round. 


low-low, high-low, and low-high. 
tion of +0.30. 
to group in the low-low and high-high direction. 


All combinations are found—high-high. 


Box B corresponds to a correla- 
You can see a barely perceptible trend for the points 
The tendency 


is more marked in box C, which represents a correlation of +0.60. 


MEASURES OF RELATIONSHIP 103 


In box D, which portrays a correlation of +0.90, the trend is much 
more marked. But even with as high a correlation as this, the scores 
spread out quite a bit and do not follow an exact line from low-low 
to high-high. We may note in passing that the scores plotted in 
Fig. 5.6 correspond to a correlation coefficient of +0.46. Procedures 
for computing the correlation coefficient are outlined in Appendix 
II for those readers who wish to carry out the calculations with a 
numerical example. 

There are two important settings in which correlation coefficients 
will be encountered in connection with tests and measurements. The 
first situation is one in which we are trying to determine how pre- 
cise and consistent a measurement procedure is. Thus, if we wanted 
to know how consistent a measure of speed we get from a 50-yard 
dash, we could have each child run the distance twice, perhaps on 
The correlation of his two scores would give us 
information on the precision or reliability of this measure of running 
speed. The second situation is one in which we are studying the 
relationship between two different measures, often in order to eval- 
uate one as a predictor of the other. Thus, we might want to study 
aptitude test as a predictor of college grades. The cor- 
an indication of the test's 


successive days. 


a scholastic 
relation of test with grades would give 
usefulness as a predictor. 

We face the problem, 
we obtain, Suppose the 
relation of 0.80. Is this satisf 
test correlates 0.60 with college grades. 


in cach case, of evaluating the correlation 
two sets of 50-yard dash scores yield a cor- 
actory or not? Suppose the aptitude 
Shall we be pleased or dis- 


couraged? 

The answer lies in part in the plots of Fig. 5.7. Clearly, the higher 
the more closely one variable goes with the other. 
away from the diagonal line from low- 
become smaller as the corre- 
these discrepancies are still 
ation coeffi- 


the correlation, 
If we think of discrepancies 
low to high-high as ' errors," the errors 
lation becomes larger. Furthermore, 
discouragingly large for even a rather substantial corre 
cient, i. e., box C in Fig. 5.7. We must always be aware of these 
that with a correlation of 0.60, for example, 
school grades, there will be a number of 
differs a good deal from what we have 


discrepancies and realize 
between aptitude test and 
children whose performance 


predicted. 
However, everything is relative, 
efficient must be interpreted in comp 
Table 5.8 contains 
been reported for differe 


and any given correlation co- 
arison to values that are com- 
monly obtained. a number of different correla- 
tions that have nt types of variables. The 


104 ELEMENTARY STATISTICAL CONCEPTS 


nature of the scores being correlated is described and the coefficient 
reported. An examination of this table will provide some initial 
background for interpreting correlation coefficients. The coefficient 
will gradually take on added meaning as the reader encounters co- 
efficients of different sizes in his reading about and work with tests. 


Table 5.8. Correlation Coefficients for Selected Variables 


Correlation 

Variable Coefficient 
Height of identical twins .95 
Intelligence of identical twins .88 
Height versus weight .60 
Intelligence of siblings +53 
Height of siblings .50 
Strength of grip and speed of running .16 
Height versus Binet I. O. .06 
Height versus educational achievement .01 
Shape of head versus intelligence .01 
Height versus sociability .00 

No. of physical defects among boys versus 

school progress —.29 


SUMMARY STATEMENT 


We opened this chapter by pointing out the various kinds of 
questions we might wish to answer by referring to a set of test scores. 
Let us look at these questions again and see what answers we have 
offered for them. 


1. How do our scores "run"; what do they "look libe"? To answer 
this question, we can arrange our scores into a frequency distribu- 
tion (Table 5.4) or plot them in a histogram (Fig. 5.1). Я 

2. What score is typical of the group; represents the middle of the 
group? "To represent the middle of the group we may calculate the 
median—the 50th percentile (p. 88), or the arithmetic mean—the 
common average (pp. 90-91). 

3. How widely spread out are the scores; how much do they scatter? 
To represent the spread of scores statisticians have developed (1) 
the semi-interquartile range, half the distance between the 25th and 
75th percentile (p. 94), and (2) the standard deviation (pp. 95-97), 
a type of average of the deviations of the scores away from the 
average, 

4. How are we to determine what the score of an individual means— 
whether it is high or low? Though this problem is left for fuller 


STATISTICAL SYMBOLS 105 


sion in Chapter 7, we have seen that the individual score takes 
on meaning as it is translated into a percentile rank, the per cent of 
the group he beat, or into a standard score, how many standard de- 
viations above or below the mean he fell (p. 100). 

5. To what extent do two sets of scores go together; to what extent are 
the same individuals high or low on both? A measure of relationship 
is given by the correlation coefficient, a numerical index of "going- 
togetherness” (pp. 101-103). This index is important as describing 
liability of a test and as describing the accuracy 


the precision or re 
r factor, such as school 


with which a test score predicts some othe 
grades or job success. 


STATISTICAL SYMBOLS 


The student who reads test manuals, books dealing with tests, or 
articles about testing in the educational journals will encounter a num- 
ber of conventional symbols to refer to statistical concepts or oper- 
ations. Some of the commonest are defined below. This table of 


definitions should help in reading later chapters of this book, as well 


as outside references. 


Symbol Definition 
The total number of cases in the group. 
Frequency. The number of cases with a specific score or in a 


particular class interval. 


* A raw score on some measure. 

* A deviation score, indicating how far the individual falls above 
or below the mean of the group. 

x A deviation score from some arbitrary reference point, often ex- 

0 pressed in interval units. 

“The number of points of score in one class interval. 

Y or M The mean of the group. 

Md The median of the group. 

Qi The lower quartile, the 25th percentile. 

Qs The upper quartile, the 75th percentile. 

Q The semi-interquartile range. Half the difference between Qs 
and Qi. 

P A percentile. 

lls which specific individual or value is 


А subscript Modifies a symbol and te 
referred to, e. f., Pio is the 10th percentile, X, is the raw score 
> of person j. 
S.D. or s Standard deviation of a se 
m Standard deviation in the 
the particular sample. 
a test item correct. 


t of scores. 

population, though sometimes used to 
refer to 

b Per cent of persons getting 


106 ELEMENTARY STATISTICAL CONCEPTS 


Symbol Definition 

q Per cent of persons getting a test item wrong (p + q = 100). 

r A coefficient of correlation. 

711 A reliability coefficient. The correlation between two equiva- 


lent test forms or two administrations of a test. 
“Take the sum of." 


SUGGESTED ADDITIONAL READING 


Dixon, Wilfred J., and Frank J. Massey, Jr., Introduction to statistical analy- 
sis, New York, McGraw-Hill, 1951. 


Lindquist, Everet F., 41 first course in statistics, rev. ed., New York, Hough- 
ton Mifflin, 1942. 


Walker, Helen M., Elementary statistical methods, New York, Holt, 1943. 


QUESTIONS FOR DISCUSSION 


1. For each of the sets of scores indicated below, select. what appears to 
you to be the most suitable class interval, and set up a form for tallying the 
scores: i 


Test Хо. of Cases Range of Scores 
Arithmetic 84 8 to 53 
Reading Comprehension 57 15 to 75 
Interest Inventory 563 68 to 224 


2. In each of the following distributions, 
interval, (b) the mid-point of the intervals 
the intervals (i.e. 


indicate (a) the size of the class 
oint of t shown, and (c) the real limits of 
the dividing lines between them) 


(1) 4-7 (2) 17-19 (3) 50.59 
8-11 20-22 60-69 
12-15 23-25 70-79 


3. Using the spelling scores given in T 
шша а histogram. Compute the median and the upper and lower 
9 ompute the arithmetic mean and standard deviation 
‚ In the Bureau of Census reports the median is used in re i 
E Bos 15 it used, rather than the arithmetic 

Ээ. A -item vocabulary test giv 5 
ee ie Seta: А to 150 pupils vielded scores ranging 

эЧ. Ninety-seven fell between 40 ¢ 50. W 1 this dis 
om 1 У and 50. What would this dis- 
d 92 PM F Ц nat would this 
me о агалык Uke What could you say about the suitability of the 
able? Wh an What measure of central tendency would be most suit- 
? \ hat measure of variability would you probably use? 


able 5.1 on p. 80, make a frequency 


porting average 
mean? 


QUESTIONS FOR DISCUSSION 107 


6. A high-school teacher gave two sections of a history class the same test. 
Results were as follows: 


Section А Section B 
Median 64.6 64.3 
Mean 65.0 63.2 
75th percentile 69.0 70.0 
25th percentile 61.0 54.0 
Standard deviation 6.0 10.5 


From these data, what can you say about the two classes? What implica- 


tions do the data have for teaching the two groups? 

7. A test in social studies, given to 2500 tenth- and eleventh-grade stu- 
dents, had a mean of 52 and a standard deviation of 10.5. How many stand- 
ard deviations above or below the mean would the following pupils fall? 


Alice 48 Henry 60 John 31 
Willard 56 Jane 36 г 84 


8. If the distribution in the previous example was approximately normal, 
about what per cent of the group would each of these pupils surpass? 
9. Explain the meaning of each of the following correlation coefficients: 


a. The correlation between scores on a reading test and on a group intelli- 
gence test is +0.78. 

b. Ratings of pupils on “good citizenship and on "aggres: 
correlation of — 0.56. h t 

c. The correlation between height and score on an achievement test is 0.02. 


veness" show a 


Chapter 6 


Qualities Desired in Any 
Measurement Procedure 


Whenever a worker in psychology or education desires to measure 
some quality in a group or individual, he faces the problem of choos- 
ing the best instrument for his purpose. Ordinarily there will be 
several tests or testing procedures that have been developed for, or 
that seem to be at least possibilities for, his purpose. He must 
choose among these. He is also probably interested in determining 
not only which is the best procedure but how well it satisfies his 
needs by some absolute standard. On what grounds can he make 
his choice or his appraisal? 

There are many specific considerations entering into the evalua- 
tion of a test, but we shall consider them here under three main 
headings. These are respectively validity, reliability, and practical- 
ity. Validity refers to the extent to Which the test measures what 
we actually wish to measure. Reliability has to do with accuracy 
and precision of a measurement procedure. 
give an indication of the extent to Which a particular measurement 
is consistent and reproducible. Practicality is concerned with a wide 
range of factors of economy, convenience, and interpretability that 
determine whether a test is practical for widespread use. These 
three aspects of test evaluation will be considered in detail in the 
following sections. 


Indices of reliability 


VALIDITY 


The first and foremost question to be 
testing procedure is: How valid is it? 
we are inquiring whether the test me 
ure, all of what we want 
want it to measure, 


asked with respect to апу 
When we ask this question. 
asures what we want it to meas- 
it to measure, and nothing but what we 


When we apply a steel t 


ape measure to the top of our desk to de- 
termine its length, we 


have no doubt that the t 
108 


ape does in fact meas- 


TYPES OF EVIDENCE OF VALIDITY 109 


ure the length of the desk and does directly serve our purpose, which 
may be to determine whether the desk will fit between two windows 
in our room. Long experience with this type of measuring instru- 
ment has confirmed beyond a shadow of doubt its validity as a meas- 
uring tool for certain purposes. 

Suppose now that we give to a group of children a test of reading 
achievement, This test requires the children to select certain 
answers to a series of questions about reading passages and to make 
little pencil marks on an answer sheet. We count the number of 
pencil marks that were made in the predetermined right places and 
give the child a score, which is the number of his right answers. We 
call this score his reading comprehension. But the score itself is 
It is the record of a sample of behavior. 


not the comprehension. 
inference based on 


Any judgment regarding comprehension is an 
the evidence provided by the number of allegedly correct answers. 
Its validity is not self-evident but is something we must establish 
on the basis of adequate evidence. 

Consider again the typical personality inventory that endeavors 
to provide an appraisal of “emotional adjustment," In this type of 
inventory the respondent marks a series of statements as being 
characteristic of him or not characteristic of him. On the basis of 
various types of procedures, which we shall consider in some detail 
in Chapter 14, certain responses are keyed as indicative of emotional 
A score is obtained by seeing how many of these 
But making certain marks on a 
ed from actually exhibiting 


maladjustment. 
responses an individual selects. 
a number of steps гетоу 
We must find some way of establishing the 
on this test actually corresponds to 
directly interested. How 


piece of paper is 
emotional disturbance. 


extent to which the performance 
havior in which we are 


the quality of be 
validity of a measurement procedure? 


can we determine the 


TYPES OF EVIDENCE OF VALIDITY 
two main types of evidence bearing on the validity of 
ıl. On the one hand, we encounter а 
1 appraisal of the validity 
dure depends primarily upon rational analysis 
and professional judgment. The analysis may be of the topics and 
the test—its content. For this type of analysis 
The rational analysis may be of 
o a particular concept 
en speak of concept or 


There are 
a test, rational and empirice 


wide range of testing situations in whicl 


of a measurement proce 


areas included in 
we shall speak of content validity. 
the activities and processes that correspond t 
(such as “scientific method"), and we may th 


construct validity. 


110 QUALITIES DESIRED IN MEASUREMENT 


The second main type of evidence of validity is empirical апа sta- 
tistical. This type of evidence comes from the relationship of the 
instrument that we are studying to some other measure or fact. 
This other measure or fact may be very closely similar to our test, 
or it may be quite different. It may be obtained at about the same 
time our test is given, or it may not be available for a long time in 
the future. It is not possible to provide labels for all the variations 
of closeness in nature of the measure or of closeness in time. We will 
use three terms in the following way: 


1. Congruent validity will refer to evidence of validity obtained by 
correlating a test with an existing similar measure of the same func- 
tion. Thus, correlating a new intelligence test with already existing 
tests would provide evidence on congruent validity. І 

2. Concurrent validity will refer to evidence of validity obtained 
by relating the test to some other measure obtained at the same 
time. If a test devised to appraise sociability were correlated with 
ratings on sociability by close friends, this would provide evidence 
on the concurrent validity of the test. 

3. Predictive validity will be used to refer to the validity of a test 
or other measuring instrument when it is related to some criterion 
of performance or success that becomes available in the future and 
is quite different from the test itself. Thus, when a scholastic apti- 
tude test given to high-school seniors is correlated with college fresh- 
man grades, evidence is being obtained on its predictive validity. 


Let us see how these several types of evidence on validity might 
apply in the case of a specific test. Suppose that we have under- 
taken to prepare, for use in high school, a test of “Correctness 
and Effectiveness of Written Expression." The test is designed to 
measure these objectives of English as it is taught in our schools. 
We wish to prepare as valid a test as possible and tà be able to report 
the evidence on the validity of the test after it has been completed. 


CONTENT VALIDITY 


How shall we judge whether our te 
of written expression? One thing we can do is examine the content 
of the test, or of the plan for the test if it is not yet made, and see 
how well this content matches what the i на 
teach. ) If we were to make a catalogu 
sion we have been trying to teach, we 


St is a valid test of correctness 


school has been trying to 
¢ of the mechanics of expres- 
could prepare a list of situations 
alize correctly, situations in which 
ton, and situations in which he 


in which the individual must capit 
he must select suitable punctua 


CONTENT VALIDITY TT 


must select the correct word order or word form—case of noun or 
pronoun, tense or number of verb. The catalogue would organize 
itself into groupings like those just indicated, would include many 
specifics within each, and should probably include an importance 
rating for each element of usage. A complete catalogue would pro- 
vide the background for judging the content validity of this part of 
our test. If the test included a representative sample of the more 
important usages, it would have high content validity. 

How can we prepare a complete catalogue and establish the weight 
or importance of cach element? If we are interested only in our own 
class or school, the list should be based on what we are actually 
teaching in that class or school. We should then analyze our own 
course of study, our own text, or our own personal objectives. Or if 
the test relates to the whole school, a group of teachers can cooperate 
on the analysis. We decide what weights the different parts should 
have by our own best judgment or by the pooled judgment of the 
group making the content analysis of the course of study. But if 
we are trying to prepare a test to be used in many schools throughout 
the country, we must then broaden the base of our content analysis. 
We should get the judgment of qualified experts as to the relative 


emphasis to be given to different aspects of usage and the importance 


of specific usages. 

Sometimes our analysis of content may go beyond specific school 
Thus, if we wished our usage test to have content valid- 
ity beyond the specific criterion of what had been taught in school, 
we might study children's letters and compositions to see what 
actually used and which ones led to 
frequent errors. This type of analysis of use and of difficulty has 
been carried out from time to time as a way of planning content 
It is equally appropriate for guiding the con- 


teaching. 


words or constructions they 


for the curriculum. 


struction of a test. р А 
When a test isa test on a job rather than on a course of instruction, 


the content analysis can appropriately be directed at the job. Thus, 
if one were preparing а certifying examination for junior accountant, 
one would ask what a junior accountant 1s required to know or do, 
how much he must do of each activity that has been isolated, and 

or the job) А proficiency 


how critical the knowledge or skill is f с 
alidity would be one that sampled in a bal- 


dges and skills. 
whether one is making a test or 


test with high content v 
anced way those essential knowle 


The basic issues are the same 4 test 
evaluating one. When the test is being constructed, the objective 
ly as possible to the important content 


is to make it conform as close 


112 QUALITIES DESIRED IN MEASUREMENT 


of the course, activity, or job. When an already-made test is being 
evaluated, the problem is to judge how well it does correspond. The 
content with which it may be compared may be (1) the content of 
a particular local text or course of study. (2) the common content 
of a number of texts or courses of study, (3) the judgment of experts 
as to what should be emphasized in a course of study, (4) the activ- 
ities the individual carries out or the errors he makes in the general 
activities of life, or (5) the knowledges and skills that must be dis- 
played in a particular job. How closely the test does in fact corre- 
spond is, in the last analysis, a matter of critical examination and 
professional judgment. 


/ CONCEPT OR CONSTRUCT VALIDITY  / 


What about the concept validity or construct validity of our test? 
Let us turn our attention now to the part of our illustrative test 
having to do with effective expression. But what is "effective expres- 
sion"? What does this phrase, this concept, really mean? Again. 
we are thrown back on rational analysis, but this time we are trying 
to analyze a concept and see what is implied by it rather than to 
make a catalogue of content. )The concept “effective expression" 
is broad, abstract, and indefinite. Test items must be specific, con- 
crete, and precise. They must consist of definite limited taks. The 
problem of preparing a test that has concept or construct validity is 
that of bridging the gap from broad general concept to specific tan- 
gible tasks or test items. 

A first level of analysis of "effectiveness of expression" might 
identify some such components as the following: 

1. Selection of ideas to be presented: ideas that 


or that fit together well. 
2. Organization of ideas for presentation. 

A. Arrangement in logical or effective order. 

B. Subordination of details to main ideas. 

Paragraphing: use of paragraphs to bring out the organization of ideas. 

4. Writing effective sentences. 

A. Having each sentence convey a single complete idea. 

B. Variety of sentence style and length. 

Effective use of words. 

A. Selecting the precise word for the meaning. 

B. Variety in words used. 

C. Choice of interesting figures of speech. 

6. Adaptation of style to message: exposition, narration, etc. 

7. Adaptation of form to audience: in style and word choice. 


are interesting or important 


w 


tn 


This analysis is neither complete nor authoritative. However, it 


shows the sorts of categories we develop when we start to analvze а 


CONCEPT OR CONSTRUCT VALIDITY 113 


general concept into measurable components. For us to develop a 
valid test of effectiveness of expression, we would have to carry out 
the type of analysis suggested by the above table, and then we 
would have to take the further step of translating the components 
into tasks that could be incorporated into a test. Thus 2A, "Ar- 
rangement of ideas in a logical order,” might be tested by giving 
the examinee 6 or 8 phrases representing points in an exposition of 
some topic (e.g., "Building a Rabbit Hutch") and instructing him 
to arrange them in the order in which they would appear in his writ- 
ing. A similar translation into test tasks would be needed for each 
of the other points. If we have shown wisdom in our analysis of the 
general function into its spe 
ing test tasks for the components, we 
test of effectiveness of expression. 

analysis of the crucial concepts is the key 
to appraising the validity of one that 
We encounter such concepts as 
“scientific thinking,” “fairmindedness,” "good citizenship," "rigid- 
"reading comprehension." Before we can make prog- 
must analyze these global and 


cific components and ingenuity in devis- 
will have achieved concept or 
construct validity for our 

Frequently this type of 
to preparing a valid test or 
has already been constructed. 


ity," or even 
with the task of measurement, we 
pts into their behavioral components. It is against 
must check our test or someone else's to judge 
This analysis differs from that 
analysis and evaluation are 


re 


often fuzzy conce 
this analysis that we 
whether it has construct validity.” 
vious section in that 
th content or subject matter acted upon but 
esses that are applied to some content. 

„with this analysis. We can recruit a 
Often we can find analyses 


discussed in the pre 
now concerned not wi 
with the functions or proc 

We may, of course, get Һер 
( Ір us make it. 
already been made by some pre- 


committee of experts to he 
ept that have 
Thus, the yearbooks put out by the various 


ation Association and of the Na- 


of a particular сопс 


vious group of experts. 
nts of the National Educ: 
Study of Educati 
ducation, the make-up of efficient 


or the nature of "critical 


departme 
tional Society for the 
e objectives of science € 
different. grade levels, 
teaching of a particular subject will often 
Here, we place our faith 
This is an- 


on contain many such analy- 


ses of th 
reading skill at 
thinking." Books on the 
ne type of analysis. 
already carried out the analys 
other type of rational and judgmental process. In each instance, the 
final test of the validity of the instrument being studied is a judg- 
ment as to how adequately the tasks included in the test represent 
into samples of behavior. 


a translation of the basic concept 1 


have somewhat the sat 


in the group that has 


114 QUALITIES DESIRED IN MEASUREMENT 


In practice, establishing the content and the construct validity 
of a test are often closely interwoven. Thus, the same steering 
committee that judges the importance of different items of content 
may undertake to translate the underlying concepts (often relating 
to particular processes to be appraised) into manageable aspects of 
behavior for testing. The two types of judgments may be made 
at the same time and by the same people. They represent, after all, 
two closely related aspects of the rational design or appraisal of a 
test. The preparation of a "blueprint" for a test, as described in 
Chapter 3 (p. 32). involves just this type of analysis of the content 
to be covered and the functions to be measured. 


/ CONGRUENT VALIDITY 

Let us turn now to the evidence that statistical procedures can 
provide on the validity of a test. We spoke first of congruent valid- 
ity, in which our new test is correlated with an e 


V ting similar 
measure.) Our hypothetical test of correctness and effectiveness of 
written expression might be given to a group that had also been tested 
with the Effectiveness of Expression Test distributed by the Coopera- 
tive Test Division of the Educational Testing Service. The correla- 
tion of our test with the existing test would show to what extent the 
two were measuring the same ability. If we are willing to accept 
the Cooperative Test as having good validity, this correlation provides 
some evidence on the validity of our test. 

| It can be seen that the type of evidence just proposed is somewhat 
circular. If it is to carry any real weight, we must already know а 
great deal about the validity of tests very similar to the aie we have 
just produced. Then a high correlation does provide some evidence 
that our new test perpetuates whatever validity was in the earlier 
measures. This evidence can be worthwhile as an initial evaluation 
of a new test in a well-studied field, as for example the field of scho- 
lastic aptitude or general intelligence. 


A somewhat different and somewhat more convincing indication 
of validity, which we can perhaps consider under this same label, is 
the responsiveness of our test to the manipulation of experimental 
conditions. For example, a measure of visual function called flicker 
fusion rate * had been proposed as an indicator of anxiety.) The 
hypothesis was that for the more anxious person the disc "ed blur 
or fuse at a slower rotation rate than for the less абына one. AS 
one test of this hypothesis, flicker fusion rate was tested is a group 


* The speed at which a disc made up of bl 


ack and white secto : rotated 
ч s must be rota 
10 be seen as а uniform grav. ET 


CONCURRENT VALIDITY 115 


of persons before and after undergoing minor surgery. Before sur- 
gery the subject was presumed to be more anxious. Results were in 
accord with expectation. Flicker fusion rates were lower before sur- 
gerv than after. The test performed in a way congruent with the 
hypothesis, and this may be taken as evidence supporting its validity 
as an indicator of the function for which it was proposed. : 


CONCURRENT VALIDITY 

Evidence on validity may be obtained from the relationship to 
obtainable information about the individual. | For 
and Effectiveness of Written Expression Test," 
imple of each child's free writing. 
errors in this free writing and give 


other currently 
our “Correctness 
we might assemble an extensive sé 
We might make an analysis of the 
for correctness of written expression. We might 


each child a score 
writing rated on 


also have representative. samples of each child's 
expression. by experienced teachers. These con- 
current. measures, rather different in nature from our test, would 
provide a basis for evaluating the test. How good a basis would 
depend upon the faith that we had in the outside measures. If we 
hly satisfied with them, then they would provide a good 
to evaluate the test. 

that appear more obviously valid than 
cedure, but are so time- 


effectiveness. of 


felt thoroug 
reference point against which 
/ Often, there are measures 
à standardized test or other objective proc 
consuming that they are not practical for routine use. The complete 

ight be one of this sort. Such measures 


analvsis of pupil errors mi 
can often be used in rescarch studies to give evidence on the validity 


of a convenient and accurate 
To give an illustration of a somewhat different kind of concurrent 


validity, let us suppose that we had given our test, “Correctness and 
Effectiveness of Written Expression," to all the editorial workers 
for a publishing company. and had at the same time gotten ratings 
from their supervisors on actory they were in their work. 
The correlation of test score 
on the concurrent validity of the 


objective test. | 


how satisf. 
5 with 
test as far as that job was con- 


ratings would provide evidence 


cerned. 

Note that none of the evidence considered in this section gives 
st will predict future success." Even if the 
"Expression. Test" with job success for the. editorial 
ght mean no more than that a good editorial worker 
is one who has learned English skills. We could not be sure that 
of the skills in advance of employment would 
f success that would be achieved later. 


апу guarantee that the te 
correlated 


workers, this mi 


presence or absence 
serve as an indicator o 


116 QUALITIES DESIRED IN MEASUREMENT 


l 


| PREDICTIVE VALIDITY 

Finally, we may be interested in appraising the validity of our 
Na + . * . H 
test as a predictor of later success in school, on the job, or in life. 
Then the effectiveness of our test procedure will be judged by the 
accuracy with which test scores predict a suitable measure of later 


5 
= BE 
-c ч 
28 ег 
95 
9 [| 55% 722 The correlation coefficent 
for these data is 49 


7|188% 1274 


6 27.2% 1701 


536.3% 1877 


Aptitude rating 


T 
3 c 
o 
8 N 
o 
% 
о 
а 
a 
m 
э. 
3. 
5 
2 
8 
а 
- 
E: 
8 
3 à 
2 
2 
g 
8. 
2 
5 
à 
a 
o 
a 
а 
E: 
a 
* 
Е 
a 
3 
v 
< 
— 


4| 47.6% 1707 
3| 57.775; 1043 
2 69.4% 553 


182.4% 250 


Fig. 6.1. 


success. This later measure is called a criterion measure. The 
evidence of the effectiveness of our prediction is Poot ча thé cos 
efficient of correlation between the test score and the later measure. 

This relationship can also be | 1 7 
example, the bar chart in Fig. 6. 
persons failing pilot training at e 
dictor test battery, Examination of the 
crease in the per cent failing training 
the low scores. The relationship picti 
to a correlation coefficient of 0.49, 


pictured in various ways. For 
1 above shows the percentage of 
ach of nine score levels on a pre- 
chart shows a steady in- 
as one goes from the high to 
ired in this chart corresponds 


PREDICTIVE VALIDITY 117 


The Problem of the Criterion. We said above that predictive valid- 
ity can be estimated by determining the correlation between test 
scores and a suitable criterion measure of success on the job. The 
joker here is the phrase "suitable criterion measure." Опе of the 
most difficult problems that the personnel psychologist or educator 
faces is that of locating or creating a satisfactory measure of job 
success to serve as a criterion measure for test validation. It may 
appear to the student that it should be a simple matter to decide 
upon some measure of rate of production or some type of rating by 
ure, once decided upon, 


superiors. It may also seem that this mea 
should be obtainable in an easy and straightforward fashion. Un- 
fortunately, this is not so. Finding or developing acceptable criterion 
measures usually involves the research worker in the field of tests 
and measurements in a number of troublesome problems. 

' Difficulties in obtaining satisfactory criterion measures arise from 
a Variety of sources. There are many types of jobs that yield no 
objective record of performance or production, as, for example, 
that of private secretary, for which we might be interested in using 
our test of effectiveness of expression. But even when such records 
are available, they are often influenced by a variety of factors outside 
the worker's ات‎ Thus, the production record of a weaver may 
depend not only upón his own skill in threading or adjusting the 
loom but also on the condition of the equipment, the adequacy of 
the lighting where he must work, or the color of the thread he must 
weave. The sales of an insurance agent are not only a function of 
his own effectiveness as a salesman but also of the territory in which 
he must work and the supervision and assistance he receives. The 
problems of effective rating of personnel are discussed in detail in 
Chapter 13. It suffices to indicate here that ratings are often un- 
stable and influenced by many factors other than the proficiency of 
the person being rated. | | 

There are always many criterion measures that might be obtained 
and used for validating a selection test. In addition to quantitative 
records and subjective ratings, which have already been 


performance Nw alr 
later tests of proficiency. This is the type 


mentioned, one might use | 
of situation that is involved when a college entrance mathemati 


test is validated in terms of its ability to predict later performance 
on a comprehensive examination on college mathematics. Here 


5 


the comprehensive examination serves as the criterion measure. 


Another common type 
tional or training program. c : [ 
may be validated against course grades in engineering school. 


of criterion is grades in some type of educa- 


Thus, tests for the selection of engineers 
N 


118 QUALITIES DESIRED IN MEASUREMENT 


All criterion measures are only partial in that they measure only 
a part of success on the job or only the preliminaries to actual job 
performance. This last is true of the engineering school grades men- 
tioned above. -They represent a relatively. immediate but quite 
partial criterion of success as an engineer. The ultimate criterion is 
some appraisal of the man's lifetime success in his profession. In 
the very nature of things, such an ultimate criterion is inacce 
to us and we must be satis 


ible 
ied with substitutes for it. These substi- 
tutes are only partial and are never completely satisfactory. Our 
problem is always to choose the most satisfactory from among the 
measures that it appears feasible to obtain. We are faced, then, with 
the problem of deciding which of several criterion measures is most 
satisfactory. How shall we arrive at this decision ? 
( Qualities Desired in a Criterion Measure. There are four qualities 
that we shall desire in a criterion measure. In order of their impor- 
tance they are (1) relevance, (2) freedom from bias, (3) reliability, 
and (4) availability.) 

^ We judge a criterion to be relevant in so far 
measure is determined by the same factors that determine success 
on the job. In appraising the relevance of a criterion, we are thrown 
back once more upon rational considerations. There is no empirical 
evidence that will tell us Whether a. particular criterion measure is 
or is not relevant. For achievement tests we found it necessary to 
rely upon the best available professional judgment to determine 
whether the content of the test was what it should have been. In 
the same way, with respect to a criterion measure it is also necessary 
to rely upon professional judgment to provide the appraisal of the 
degree to which any available partial criterion measure is relevant 
to the ultimate criterion of job success, 

А second factor important in 
from bias. By this we me. 
person with the 


as score on the criterion 


a criterion measure is that of freedom 
an that the measure 
ame opportunity to make 
of biasing factors are such things 
district to another in our pre 


should provide each 


a good score. Examples 
ngs as the variation in wealth from one 
rict to | vious example of the insurance salesman, 
variation in the quality of equipment and conditions of work of a 
factory worker, variation in generosity of the 
secretaries, or variation in the і 
different classes. One 


bosses rating private 
skill of teachers instructing pupils in 
from the relationship eae 2 bey ba difficult to get meaning 
‘ sults to a criterion score if that score 
depends upon factors in the conditions of work her than factors 
in the individual worker. И 

The topic of reliability w 


ill be discussed in general terms later in 


THE INTERPRETATION OF VALIDITY COEFFICIENTS 119 


this chapter. As it applies to the criterion scores, the problem is 
merely this: a measure of success on the job must be stable or repro- 
ducible if it is to be predicted by any type of test device. If the 
criterion performance is one that jumps around from day to day so 
that the person who shows high job performance one week may show 
low job performance the next, then there is no possibility of finding 
a test that will predict it. A measure that is fundamentally unstable 
itself cannot be predicted by anything else. 

Finally, in the choice of criterion measure one always encounters 
practical problems of convenience and availability. How long is it 
going to take to get a criterion score for cach individual? How much 
is it going to cost?) Though a personnel research program can often 
afford to spend a substantial part of its effort in getting good criterion 
data, there is always a practical limit. Any choice of a criterion 
measure must take account of this practical limit. 


THE INTERPRETATION OF VALIDITY COEFFICIENTS 

Suppose that we have gathered test and criterion scores for a 
group of individuals and computed the correlation between them. 
Perhaps our predictor is a scholastic aptitude test, and the criterion 
is an average of college freshman grades. How shall we now decide 
whether the test is a good predictor? 

Obviously, other things being equal, the higher the correlation, 
the better. In one sense, our only basis for evaluating a predictor 
is in relation to other possible prediction procedures. Does test 
A yield a higher or lower validity coefficient than other tests? Than 
other types of information, such as high-school grades or rating by 
school principals? We will look with favor on any measure whose 
validity for a particular criterion is higher than that of measures 


previously available to us. | | 

Some representative validity coefficients are exhibited in Table 
6.1. These give a little general picture of the size of correlation that 
has been obtained in previous work of different kinds. The investi- 
gator concerned with a particular course of study or a particular job 
criterion will, of course, need to become intimately acquainted with 
validities found for his particular criterion. 

The usefulness of a test as a predictor depends not only on how: 
well it correlates with a criterion, but also on how much new in- 
formation it gives. Thus, the Differential Aptitude Tests’ Verbal Rea- 
soning Test correlates on the average 0.48 with high-school English 
grades, and a test of sentence usage correlates 0.51 with the same 


grades. But the two tests have an intercorrelation of 0.62. They 


120 QUALITIES DESIRED IN MEASUREMENT 


Table 6.1. Validity of Selected Tests as Predictors of Certain Educational 
and Vocational Criteria 


Validity 
Predictor Test Criterion Variable Coefficient 
Pintner General Ability Test Metropolitan Achievement— 
Reading Comp. (Gr. 5) .16 
Metropolitan Achievement — 
Total Score (Gr. 5) ‚84 
ACE Psychological Exam— College Grades— English 48 
Score College Grades Math JA 
College Grades—Art ‚24 
Seashore Tonal Memory Test Performance test on stringed in- 
strument «28 
Short Employment Test 
Word Knowledge Score Production index—80 bookkeeping 
machine operators .10 
Word Knowledge Score Job grade 106 stenographers EX 
Arithmetic Skill Score Production index—80 bookkeeping 
machine operators .26 
Arithmetic Skill Score Job grade—106 stenographers 00 
Differential A ptitude Tests 
Verbal Reasoning English grades 315 years later Ju 
Space Relations English grades 314 years later 
Mechanical Reasoning English grades 315 years later 611 


overlap and, in part at least, the information each test provides is 
the same as that provided by the other test. The net result is that 
pooling the two tests can give a validity coefficient of no more than 
0.55. If the two tests were uncorrelated, cach giving evidence com- 
pletely independent of the other, the combination of the two tests 
would give a validity coefficient of 0.70. 

Clearly, the higher the correlation between a test or other pre- 
dictor and a criterion, the more pleased we shall be. But in addition 
to this relative standard, we should like some absolute one. How 
high must the validity coefficient be for the test to be useful? What 
is a "satisfactory" validity? This is a little bit like asking, "How 
high is up?" However, we сай try to give some sort of answer. 

То an organization using a test as a ba 


. А sis for deciding whether to 
hire a particular job applicant or admit a particular student, the 


* Statistical procedures have been developed to determine the best weighting t 
give the two or more predictors and to calculate the correlation that will result 
from this combination. The procedures for computing the weights for the separate 
components (called regression weights) and the correlation (multiple correlation? 


resulting from them are bevond the scope of this discussion but will be found in 
standard statistics texts. 


THE INTERPRETATION OF VALIDITY COEFFICIENTS 121 


significant question is: How much more often will we make the right 
decision on whom to hire or admit if we use this test than if we oper- 
s or on the basis of some less valid meas- 


ate on a purely chance bas 
ure? The answer to this question depends in considerable measure 
on the proportion of individuals who must be accepted. A selection 
procedure can do much more for us if we need to accept only the 
individual who appears to be the best one in every ten applicants 
than if we must accept nine out of ten. However, to provide a 
specific example, let us assume that we will accept half of the appli- 
cants. We may then ask what percent of the ones we accept will 
fall in the upper half of the whole group in job success, i.e., in what 
per cent of our decisions do we make a "correct" choice? The per 
cent of correct choices that will result for correlations of different 
sizes is shown in Table 6.2. 

Table 6.2 indicates that when the correlation is zero, the per cent 
of correct decisions is 50. This is exactly the chance value. Fifty 
Table 6.2. Per Cent of Correct Assignments When 50 Per Cent of Group 

Must Be Selected 


Validity of Per Cent of 
Selection Procedure Correct Choices 
.00 50.0 
.20 56.4 
.40 63.1 
.50 66.7 
.60 70.5 
70 74.7 
80 79.5 
.90 85.6 


are defined as successes, i.e., as falling in the 
‚ and if we had picked our students or 
we could have been right 50 per 


per cent of our cases 
upper half of the total group 
employees by just flipping à coin, 
cent of the time. The improvement in our "batting average" as the 


correlation goes up is shown in the table. Thus, for a correlation 
of 0.40 we will pick right 63.1 per cent of the time; with a correlation 
of 0.80 our percentage will be 79.5 per cent, and so forth. А 
The table shows not only our accuracy for any given correlation 
ain in accuracy if we raise the validity of our predictor. 
Thus, if we were able to replace a predictor with a validity of 0.40 
a validity of 0.60, we would increase our per cent of cor- 
f to 70.5. All these percentages refer, of 
tin the previous paragraph. However, 


but our g 


by one with 
rect. decisions from 63.1 
course, to the ground rules se 


122 QUALITIES DESIRED IN MEASUREMENT 


Table 6.2 gives a fairly representative basis for understanding the 
effects of a selection program from the point of view of the employ- 
ing or certifving agency. 

Another way of appraising the practical significance of a correla- 
tion coefficient, and one that is perhaps more meaningful from the 
point of view of the person being tested, is shown in Table 6.3. The 
rows in the little tables represent the fourths of a group of appli- 
cants, potential students or employees, with respect to a predictor 
test. The columns indicate the per cent of cases falling in each fourth 
on the criterion score. Look at the little table in Table 6.3 corre- 
sponding to a validity coefficient of 0.50. We see that of those who 
fall in the lowest fourth on our predictor 480 out of 1000 or 48.0 
per cent fall in the lowest fourth on the criterion score, 27.9 per 
cent in the next lowest fourth, 16.8 per cent in the next to highest 
fourth, and 7.3 per cent in the highest fourth. The diagonal entries 
represent cases that fall in the same fourth on both predictor and 
criterion. The further we get from the diagonal, the greater the 
discrepancy between prediction and outcome. 

This table emphasizes not so much the gain from using the pre- 
dictor test as the variation in job success of those who are similar in 
predictor scores. From the point of view of schools or employers, 
the important thing is the improved percentage of accuracy illus- 


trated in Table 6.2. Dealing in large numbers, they can count on 
gaining from any predictor that is more valid than the procedure 
currently in use. From the point of view of the single individual, 
the many marked discrepancies between r 
cess shown in Table 6.3 may seem at least as important. If he has 
done poorly on the test, he may be less impressed by the fact that 
the probability is that he will be below average on the job than by 
the fact that he may do very well. He may always be the exception. 

One further point can well be emphasized in conclusion. Validity 
is always specific to a particular curriculum or a particular job. When 
an author or publisher claims that his test is valid, it is always ap- 
propriate to ask: Valid for what? A test in social studies that ac- 
curately represents the content and objectives of one program of 
instruction may be quite inappropriate for the program in a differ- 
ent community. The test must always be evaluated against the 
objectives of a specific program of instruction. By the ame token, 
a test that is a useful predictor of speed in learning to fly a light plane 
may have no value in relation to instrument flying in d transport 
plane or heavy bomber. 


predicted and actual suc- 


Validity must always be evaluated in re- 


lation to the specific situation in which a measure is to be used 


RELIABILITY 123 


Table 6.3. Accuracy of Prediction for Different Values of the 
Correlation Coefficient 


(1000 cases in each row or column) 


r= .00 r = .60 
Quarter on Criterion Quarter on Criterion 
Quarter on — Quarter on 
Predictor 4th 3rd 2nd lst Predictor 4th ard 2nd Ist 
lst 250 250 250 250 Ist 45 141 277 537 
2nd 250 250 250 250 2nd 141 264 318 277 
3rd 250 250 250 250 3rd 277 318 264 141 
4th 250 250 250 250 4th 537 277 141 45 
r= 40 r= 70 
Quarter on Criterion Quarter on Criterion 
Quarter on — ———— Quarter on ———— 
Predictor 4th 3rd. 2nd Ist Predictor 4th 3rd 2nd Ist 
Ist 104 191 277 428 1st 22 107 270 601 
2nd 101 255 277 211 2nd 107 270 353 270 
3rd an e 355 191 3rd 270 353 270 107 
4th 428 277 191 104 4th 601 270 107 22 
r= 50 r = .80 


Quarter on Criterion Quarter on Criterion 
Quarter on —— — Quarter on — — 


Predictor 4th 3rd 2nd Ist Predictor 4th 3rd 2nd Ist 
Ist 73 168 279 480 Ist 6 66 253 675 
2nd 168 268 295 279 2nd 66 271 410 253 
3rd 279 295 268 168 3rd 253 410 271 66 
4th 480 279 168 73 4th 675 253 66 6 


RELIABILITY 


The second question we raise with respect to a measurement pro- 
We are now asking not what it meas- 


cedure is: How reliable is it? 
ures but how accurately it measures whatever it does measure. What 
How accurately will it be 


is the precision of our resulting score? 
asure the individual again? 

Suppose you were to weigh each child in a school class today and 
the school nurse were to weigh each child tomorrow on a good pair 
You would not agree perfectly. The two weights 
liffer somewhat in some cases. The 


reproduced if we me 


of beam scales. 
recorded for a child would c 


126 QUALITIES DESIRED IN MEASUREMENT 


be tomorrow, or next week, or next month? Both are sensible ques- 
tions. But they are not the same question. The data we must 
gather to answer one are different from the data we shall need to 
answer the other. 

To study the reliability of such a physical characteristic of a per- 
son as weight, repetition of the measurement is a straightforward 
and satisfactory operation. It appears satisfactory and applicable 
also with some simple types of behavior, such as speed of reaction 
or muscular strength. But suppose now we are interested in the re- 
liability of a test of reading comprehension. Let us assume that 
the test is made up of six reading passages with ten questions on 
each. We administer the test once and then immediately administer 
it again. What happens? Certainly, the child is not going to have 
to reread all the material he has just read. He may do so in part, 
but to a considerable extent his answers the second time will involve 
merely remembering what answer he had chosen the time before 
and marking it again. If he had not been able to finish the first time, 
he will now be able to work ahead and spend most of his time on 
new material. These same effects will hold true to some degree 
even over a longer period of time. 


Clearly, this sort of test given a 
second time does not present the same task that it did the first time. 

There is a second consideration entering into the repetition of 
such a test as a reading comprehension test. 
the five passages in the test w 


Suppose that one of 
as about baseball and that a particular 
boy was an expert on baseball. The passage would then be especially 
easy for him, and he would in effect get a bonus of several points. 
The test would overestimate his gencral level of reading ability. But 
note that it would do it consistently 


on both testings because the 
material remains the same. 


The error for individual S is a constant 
error in the two testings. Since it affec 


way, it makes the test look reliable г; 

In such an area of ability as reading, we must recognize the possi- 
bility that an individual does not perform uniformly well through- 
out the whole arca. His Specific interests, 
ground give him strengths and weaknesses, A particular test is one 
sample from the whole area. How well individual S does on the test, 
relative to others, is likely to depend in 
ticular sample of tasks chosen to represen 
sonality we are trying to appraise. If the sample remains the same 
for both measurements, his behavior will stay more nearly the same 
than if the sample of tasks is varied. \ 

Note that so far we have 


ts both his scores in the same 
ather than unreliable. 


experiences, and back- 


some degree upon the par- 
t the arca of ability or per- 


identifed three main sources of varia- 


PARALLEL TEST FORMS 127 


tion in performance that will tend to reduce the precision of a par- 
ticular score as a description of an individual: 


1. Variation in response to the test at а particular moment in 
time. 

2. Variation in the individual from time to time. 

3. Variation arising out of the particular sample of tasks chosen 
to represent an area of behavior. 
Retesting the individual with identically the same test can be ar- 
ranged to reflect the first two types of "error," but this procedure 
cannot evaluate the effects of the third туре. In addition, there 
may be the memory and practice effects to which we referred above. 


PARALLEL TEST FORMS 

Concern about this third source of variation, variation arising be- 
tv of choosing a particular sample of tasks to 
represent a whole area of behavior, leads us to another set of proce- 
dures for evaluating reliability. If the sampling of items may be 
anch if, as is usually the case, we want 
we may generalize from the specific 


cause of the песе 


a significant source of “error, 
to know with what accuracy 
score to the area of behavior it is supposed to represent, we must 
develop some procedures that take account of this variation due to 
the sample of tasks. We may do this by correlating two equivalent 
forms of a test. 
Equivalent forms of 
according to the same spe 
rate samples of behavior 
reading tests should contain readit 
same difficulty. The same sorts of questions should be asked, i.e., 
the same balance of specific fact and general idea questions. The 
same types of passages should be represented, i.c., expository, argu- 
But the specific passages and questions should 


a test should be thought of as forms built 
cifications but otherwise representing sepa- 
in the defined area. Thus, two equivalent 
ig passages and questions of the 


mentative, esthetic. 
be different. 

If we have two forms of a test, we may give each pupil first one 
form and then the other. They may follow cach other immediately 
if we are not interested in stability over time, or may be separated 
The correlation between the two forms 
reliability coefficient. If a time interval 
has been allowed between the testings, all three of our sources of 
variation will have had a chance to get in their effects—variation 
arising from the measurement itself, variation in the individual over 
sample of tasks. 


by an interval if we are. 
will provide an appropriate 


time, and variation due to the 


128 QUALITIES DESIRED IN MEASUREMENT 


To ask that a test yield consistent results under these conditions 
is the most rigorous standard we can set for it. And if we want to 
use our test results to generalize about what Johnny will do on other 
tasks of this general sort next week and next month, then this is the 
appropriate standard by which to evaluate a test. For most educa- 
tional situations, this /s the wav we want to use test results, and so 
evidence based on equivalent test forms should usually be given 
the most weight in evaluating the reliability of a test. 

The use of two parallel test forms provides a very sound basis for 
estimating the precision of a psychological or educational test. This 
procedure does, however, raise some practical problems. It de- 
mands that two parallel forms of a test be available and that time 
be allowed for administering two separate tests. Sometimes no sec- 
ond form of a test exists, or no time can be found for a second testing. 
To administer a second separate test is often likely to represent à 
somewhat burdensome demand upon available resources. These 
practical considerations of convenience and expediency have made 
test makers receptive to procedures that extract an estimate of re- 
liability from administration of only one form of a test. However, 
such procedures are compromises at best. The correlation between 
two parallel forms, usually administered with a lapse of several days 
or weeks in between, represents the preferred procedure for estimat- 
ing reliability. 


SUBDIVIDED TEST 


The most widely used procedure for estimating reliability from a 
single testing divides a particul 


ar test up into two presumably equiv- 
alent halves. 


The half-tests may be assembled on the basis of care- 
ful examination of the content and difficulty of each item, making 
а systematic effort to balance out the content and difficulty level of 
the two halves. A simpler procedure, which is often relied upon to 
give equivalent halves, is to put alternate items into the two half- 
tests, that is, to put all the odd-numbered items in one half-test and 
all the even-numbered items in the other. This is usually a sensible 
procedure, since items of similar form, content, or difficulty are likely 
to be grouped together in a test. For a reasonably long test, say, of 
60 items or more, splitting the test up in this way will tend to balance 
out factors of item form, content covered, and difficulty level. The 
two half-tests will have a good probabilit 1 
alent” tests, as these are defined in the 

The procedures we are dis 


У of constituting equiv- 
preceding section. 

cussing now divide the test in half only 
for scoring, not for administration, 


That is, a single test is given 


SUBDIVIDED TEST 129 


at a single sitting and with a single time limit. However, two sepa- 
by scoring the odd-numbered items and 
The correlation between 


accuracy with which the 


rate scores are derived—one 
one by scoring the even-numbered items. 
these two scores provides a measure of the 
test is measuring the individual. 

However, it must be noted that the computed correlation is be- 
tween two half-length tests. This value is not directly applicable 


to the full-length test, which is the actual instrument prepared for 


use. In general, the larger the sample of a person's behavior we have, 
the more reliable the measure will be. The more behavior we record, 
the less our measure will depend upon chance elements in behavior 


of the individual or in the particular sampling of tasks. Single lucky 


answers or momentary lapses of attention will be more nearly evened 


out. 

Where the two halves of the test, 
correlated, are equivalent, we can get 
test reliability from the correlation be 
estimate is given by the formula 


which gave the scores actually 
an unbiased estimate of total- 
tween the two half-tests. This 


(1) 


of the full-length test. 


where rj; is the estimated reliability 
between two half-length tests. 


Pigg is the actual correlation 


°з 


Thus, if the correlation between the two halves of a test is 0.60, 


formula 1 would give 
711 


This formula, referred to generally as the Spearman-Brown Proph- 
есу Formula from the names of its originators and function, makes 
it possible for us to compute ап estimate of reliability from a single 
administration of a single test. 
_ The appealing convenience 
its wide use. Many test manu 


of the split-half procedure has led to 
als will be found to report this type 
of reliabilitv coefficient and no other. Unfortunately, this cocfficient 


has several types of limitations, which we must now examine. 

In the first place, when we have extracted two scores from a single 
testing, both scores necessarily represent the individual as he is at 
the same moment of time. Even events lasting only a few minutes 


will affect both scores about equally. In other words, variation of 
the individual from day to day cannot be reflected in this type of 


130 QUALITIES DESIRED IN MEASUREMENT 


reliability coefficient. It can only give evidence as to the precision 
with which we can appraise him at a specific moment in time. 

In the second place, a split-half reliability coefficient. becomes 
meaningless when a test is highly speeded. Suppose we have a test 
of simple arithmetic, made up of problems like 3 + 5 = ?, and that 
the test is being used with adults with a 2-minute time limit. We 
will get wide differences in score on such a test, but the differences 
will be primarily differences in speed. Errors will be a minor factor. 
The person who gets a score of 50 will very probably have attempted 
just 50 items, and of these 25 will be odd and 25 will be even. In 
other words, the two halves of the test will appear perfectly consist- 
ent, because opportunity to attempt items is automatically balanced 
out for the two half-tests. 

Few tests depend as completely upon speed as does the one that 
we have chosen to illustrate our point. However, many involve 
some degree of speeding. This speed factor will tend to inflate esti- 
mates of reliability based on the split-half procedure. The amount 
of overestimation will depend upon the degree to which the test is 
speeded, being greater for those tests in which speed plays a greater 
role. However, speed enters in sufficiently generally that split-half 
estimates of reliability should alwavs be discounted, Test users 
should demand that commercial publishers provide reliability esti- 
mates based on parallel forms of the test. 


RELIABILITY ESTIMATED FROM ITEM STATISTICS 


The teacher or investigator who makes much use of tests and who 
reads extensively in test manuals will encounter one other type of 
procedure for estimating test reliability from a single test adminis- 
tration. This procedure, also named for its originators, vields what 
is referred to as a Kuder-Richardson reliability: совет, The 
essential assumption in the procedure is that the items within one 
form of a test have as much in common with one another as do the 
items in that one form with the corresponding items in a parallel or 


equivalent form. This means that the items in a test are homoge- 
neous in the sense that every item measures the same general fac- 
tors of ability or personality as do the others. If this assumption 
is sound, the Kuder-Richardson procedure leads to a reliability esti- 
mate that has essentially the same interpretation as the odd-even 
coefficient we have just considered. The Kuder-Richardson estimate 
likewise (1) takes no account of variation in the individual from 
time to time, and (2) is inappropriate for speeded tests, Within 


COMPARISON OF METHODS 131 


these two limitations, it provides a conservative estimate of the split- 


half type of reliability.* 


COMPARISON OF METHODS 
A summary comparison of the different procedures for estimating 
reliability is given in Table 6.4. This shows four factors that may 


Table 6.4. Sources of Variation Represented in Different Procedures for 
Estimating Reliability 


ing Reliability 


Experimental Procedure for Estin 


Retest Parallel Parallel Odd-Even — Kuder- 
Test Form Test Form Halves Richardson 


Immediate after 
Retest, Interval, without with of Analysis, 
Same Same Time Time Single Single 
Sources of Variation Test Test Interval Interval Test Test 
How much the score can be ex- 
pected to fluctuate owing t0: 
Variations arising within the 
, Measurement procedure tself x X X X X х 
Changes in the individual from 
ay to day x x 
s in the specific sample " 
x х X x 
Changes in the individual's 
speed of work x x x x 


make a single test score an inaccurate picture of the individual's 
The table shows which of the factors are repre- 


Usual performance. 
procedures for estimating reliability we have 


sented in each of the 


* A widely used form of the Kuder-Richardson procedure (their Formula 20) 


takes the form ; st — Spa 
1 —— 
ro ar Ч 


where rji is the estimate of reliability. 
ms in the test. 


n is the number of ite 
andard deviation of the test. 
take the sum of” and covers the п items. 


ag a particular item. 


sı is the st 


5 means 
p is the per cent pa 
q is the per cent failing t 


he same item. 


A formula involving simpler calculations (their Formula 21), which yields a rea- 
ition to the above, is 


Mt ( 


n 


es E]‏ چ 


n—1 st 


s Я 
Onably close approxima 
M 


m= 


where M, is the mean score of the group and the other symbols have the same 


Meani : 
eaning as given above. 


132 QUALITIES DESIRED IN MEASUREMENT 


discussed. It can be seen that the different procedures are not equiv- 
alent. Only administration of parallel test forms with a time inter- 
val between permits all sources of variation to have their effects. 
Each of the other methods masks some source of variation that may 
be significant in the actual use of tests. Retesting with the same 
identical test neglects variation arising out of the sample of items. 
Whenever all the testing is done at one point in time, variation of 
the individual from day to day is neglected. When the testing is 
done as a unit with a single time limit, variation in speed of respond- 
ing is neglected. The facts brought out in this table should be borne 
in mind in evaluating reliability data found in a test manual or in 
the report of a research study. 


INTERPRETATION OF RELIABILITY DATA 


Analysis of data obtained for a general intelligence test for ele- 
mentary-school children has yielded a reliability coefficient of 0.85. 
How shall we interpret this result? What does it mean concerning 
the precision of an individual's score? Should we be pleased or dis- 
satisfied to get a coefficient of this size? 

We have already tried to give some content and meaning to cor- 
relation coefficients in Fig. 5.7 and in Tables 5.8, 6.1, 6.2, and 6.3. 
These have shown typical values of the correlation coefficient, the 
scatter of scores for representative correlations, and the accuracy 
of prediction with correlations of different sizes. A further contribu- 
tion to the interpretation of test reliability is found in the relation- 
ship between the reliability coefficient and the standard error of 
measurement. 

It will be remembered that the standard error of measurement is 
an estimate of the standard deviation that would be obtained for 
a series of measurements of the same individual. (It is assumed 
that he is not changed by being measured.) The standard error of 
„ can be calculated from the reliability coefficient by the 
ormula . 


$m = St 


(2) 
where Sm is the standard error of measurement. 
sı is the standard deviation of test scores. 
ry, is the reliability coefficient. 


Suppose that our test has a reliability of 0.85 and a standard de- 
viation of 15 points. Then we have 


за = 15V/1 — 0.85 = 15У0.15 = 5.7 


INTERPRETATION OF RELIABILITY DATA 133 


t of measures of a particular person would have 

Remember that a fairly constant 
ithin any given number of stand- 
Certain values for this relation- 


In this instance, a se 
a standard deviation of 5.7 points. 
proportion of observations fall w 


ard deviation units from the mean. 
ship were given in Table 5.6. This table shows that for a normal 
about 1 in 3, differ from the mean 


curve 31.8 per cent of cases, or 
4.6 per cent by as much as 2 


by as much as 1 standard deviation; 
Applying this to our case. in which the stand- 


nts is 5.7 points, we could say that 
there is about 1 chance in 3 that a score that we get for an individual 
differs from his “true” score by as much as 5.7 points (1 standard 
There is about 1 chance in 20 that it differs 
by as much as 11.4 points (2 standard errors of measurement). 

The values shown above are fairly representative of what might 
be found for intelligence quotients from one of the commercially 
distributed group intelligence tests applied to children in the upper 
elementary grades. Note that even with this relatively high reliabil- 
ity coefficient, appreciable errors of measurement are possible in 
at least a minority of cases. Shifts of 5 or 10 points of I.Q. can be 
expected fairly frequently just because of errors of measurement. 
Anyone who is impressed by and tries to interpret an I.Q. difference 
of 5 points between two persons or two testings of the same person 
has been fooled into thinking the test has a precision that it simply 
does not possess. Further testing could perfectly well reverse the 
result. Any test score or comparison of test scores must be made 
with acute awareness of the standard error of measurement. 

The manner in which the standard error of measurement is related 
to the reliability coefficient is shown in Table 6.5. We see that the 


Е Measurement for Different Values of 


standard deviations. 
ard deviation of our measureme 


error of measurement). 


Table 6.5. Standard Error o 
Reliability Coefficient 


Standard Error of Measurement 


Reliability 
Coefficient 
.50 
.60 
.70 
.80 
.85 
.90 
.95 
.98 


General Expression When 5, * = 10 


* 


BN WwW Rao 
* RO кә бо tA їл G9 


* S, signifies the standard deviation of the test. 


134 QUALITIES DESIRED IN MEASURFMENT 


magnitude of errors decreases as the reliability increases, but we also 
see that errors of appreciable size will still be found even with reliabil- 
ity coefficients of 0.90 or 0.95. In interpreting the score of a par- 
ticular individual, it is the standard error of measurement that 
must be kept in mind. If we think of a range extending from 2 
standard errors of measurement above the obtained score to 2 below, 
we will have a band within which we can be reasonably sure (19 
chances in 20) that the individual's true score lies. Thus, in the case 
of the intelligence test described in previous paragraphs, we can think 
of a test I. O. of 90 as meaning rather surely an J. O. lying between 
about 80 and 100. If we think in those terms, we shall be much 
more discreet in interpreting and using test results. 

When interpreting the test score of an individual, it is desirable 
to think in terms of the standard error of measurement and to be 
somewhat humble and tentative in drawing conclusions from that 
test score. But for making comparisons between tests and for a num- 
ber of types of test analysis, the reliability coefficient will be more 
useful. Where measures are expressed in different units, as height in 
inches and weight in pounds, the reliability coefficient provides the 
only possible basis for comparison. Since the competing tests in a 
given field, such as primary reading, are likely to use types of scores 
that are not really comparable, the reliability coefficient will usually 
represent the only satisfactory basis for test comparison. Other 
things being equal, we shall prefer the test with the higher reliability 
coefficient, that is. the test that provides a more consistent ranking 
of the individual within his group. 

The other things that may not be equal are primarily considera- 
tions of validity and practicality. Validity, in so far as we can ap- 
praise it, is the crucial test of a Measurement procedure. Reliability 
is important only as a necessary condition for a 


measure to have 
validity. The ceiling for the possible validity of a test is set by its 
reliability. A test must measure something before it can meas 
what we want it to measure, A measuring device with a reliabi 
of 0.00 is reflecting nothing but chance factors. 


ure 
ity 
It does not correlate 
with itself and cannot correlate with anything else. The theoretical 
ceiling for the validity coefficient of a test (i.e., its correlation with 
some criterion measure representing success in learning or on the 
job) is the square root of its reliability coefficient. Thus, a test with 
reliability coefficient of 0.36 could not give a validity coefficient 
above 0.60, and one with a reliability coefficient of 0.64 could not 
possibly vield a validity coefficient above 0.80. Only to the extent 
that a test measures something accurately can it measure it validly. 


INTERPRETATION OF RELIABILITY DATA 135 


The converse of the relationship we have just presented does not 
follow. A test may measure with the greatest precision and still 
have no validity for our purposes. Thus, we can measure head size 
with a good deal of accuracy, but the measure is still useless as an 
indicator of intelligence in first graders. Validity is something over 
and beyond mere accuracy of measurement. 

Considerations of cost, convenience, etc. may also sometimes lead 
to a decision to use a less reliable test. We may accept a less reliable 
40-minute test in preference to a more reliable 3-hour one because 
the 3 hours of testing time is too much of a burden in view of the 
purpose the test is designed to serve. 

Within the limitations discussed in the preceding paragraphs, we 
There are several factors that 


shall prefer the more reliable tes 
must be taken into account, however, before we can fairly compare 
the reliability coefficients of two or more different tests. These will 
be discussed in the paragraphs that follow. 


1. Range of the Group. The reliability coefficient indicates how 
consistently a test places cach individual relative to the others in 
the group. When there is little shifting from test to retest or form 
A to form B, the reliability coefficient is high and vice versa. But 
the extent to which individuals will switch places depends on how 
closely similar they are. It does not take very accurate testing to 
differentiate the reading ability of second graders from that of sev- 
But to place each second grader accurately within his 


enth graders 
Own class is 

If children from seve 
May expect a much hig 


much more demanding. 
ral different grades are pooled together, we 


her reliability coefficient. For example, the 
manual for the Otis Quick-Scoring Mental Ability Test—Beta reports 
alternate-forms reliabilities for single grade groups ranging from Oe 
100.87. The average value is 0.78. But pooling the complete D es 
Srades (4-9), the reliability coefficient 1s reported as 0.96. *[ = 
data are all for the same test. They reflect the same precision. as 
the coefficient for the combined groups is strikingly higher. se el 
data are reported for the Durrell-Sullivan Reading Achievement Бч 

a range of 4 grades—from grade 
Reliability coefficients are split-half reli- 
In the case of the Word Meaning 
grade is 0.93, whereas the 
For the test of Para- 


The data in this case involve 
three through grade six. l 
abilities based on a single testing. 5 
Test, the average coefficient for à single А 
1 . -— 87 and 0.94 
ray PANT. ‚ corresponding values а 87 2 . 
` 5 reliability coefficient, the range of ability 


1 If the reliability 
n the group tested must be t 


aken into account. 


136 QUALITIES DESIRED IN MEASUREMENT 


coefficient is based upon a combination of age or grade groups, it 
must usually be sharply discounted, as can be seen above. But even 
in less extreme cases, account must be taken of the variability of 
talent within the group. Reliabilities for age groups will tend to 
be somewhat higher than for grade groups, because an age group 
will usually contain a greater spread of talent than a single grade. 
A sample made up of children from a wide range of socio-economic 
levels will tend to yield higher reliabilities than a very homogeneous 
one. In comparing different tests, one must take account of the 
tvpe of sample on which the reliability data were based, in so far 
as this can be determined from the reported facts, and judge more 
severely the test whose reliability is based on the more heterogeneous 
group. 

2. Level of Ability in the Group. Precision of measurement by a 
test may be related to the ability level of the persons being measured. 
However, no simple rule can be formulated for stating the nature of 
this relationship. It depends upon the way in which the particular 
test was built. For those people for whom the test is very hard, so 
that thev are doing a large amount of guessing, accuracy is likely to 
be low. At the other extreme, if a test is very easy for a group, so 
that all of them can do most of the items very easily, it may be 
expected to be ineffective in discriminating among the members of 
the group. When everyone can do the easy items, it is as if we had 
shortened the test to just the few harder items that some can do 
and some cannot. 

It is possible, also, that a test may vary in accuracy at different 
intermediate difficulty levels. The meticulous test constructor will 
report the standard error of measurement for his test at different 
score levels. When separate values of the standard error of measure- 
ment are reported in the manual, they provide a basis for evaluating 
the precision of the test for different types of groups. They permit 
a more appropriate estimate of the accuracy of a particular individ- 
ual's score. Each individual's score can be interpreted in relation 
to the standard error of measurement for scores of that level. For 
example, from data provided by Terman and Merrill 5 the standard 
error of measurement for the Revised Stanford-Binet for different J. O. 
levels is found to be as follows: 


I.Q. Level Standard Error of J. O. 
130 and over 5.2 
110-129 4.9 
90 109 4.5 
70- 89 3.8 


Below 70 23 


INTERPRETATION OF RELIABILITY DATA 137 


the variation that may be expected from one testing 
to another is very much higher for children with average and above 
average I.Q.’s than for the retarded child. In the case of the Wechsler 


Scale for Children, the standard error of measurement 
The manual reports 


For this test, 


Intelligence 
depends upon the age of the group tested. 


values as follows: 


year-olds 4.2 points of I.Q. 
а ы 314 
„ a 37 o7" 


The test is most accurate for an age group in the middle of the age 


range for which it was intended. 
3. Length of Test. Аз we saw on p. 129 in discussing the split- 


half reliability coefficient, test reliability depends on the length of 
the test. If we can assume that the quality of the test items and 
the nature of the examinees remain the same, then the relationship 
of reliability to length can be expressed by a simple formula. The 


formula is 
nr 
(3) 


fan = 
"1+ (n = Dı 
where r,, is the reliability of a test п times аз long as the original 


test. a 
liability of the original test. 


rı, is the re s 
factor by which the length of the test 


n is, as indicated, the 
is increased. 

1 of formula 1 found on p. 129. 

up of 20 items which has 

ow reliable the test will 


This is a more general forn 

Suppose we have a spelling 
a reliability of 0.50. We want to know h 
be if it is lengthened to contain 100 items comparable to the original 


test made 


20. The answer is : 
5(0.50 2.50 
ر‎ _ зой 280 ogs 
nn 1 J 400.50) 3.00 
st is increased, the chance errors of measure- 
comes to depend more and more 
rson being measured; 


As the length of the te 
ment more or less cance 
completely upon the charac 


] out; score 

teristics of the pe 

raisal of him is obtained. 

can lengthen a test is limited by a num- 
It is limited by the amount of time 

ted by factors of fatigue and boredom 

n s limited by the stock of 


and a more accurate арр 

Of course, how much we 
ber of practical considerations. 
available for testing. It is limi : 
on the part of examinees. It is sometime 


138 QUALITIES DESIRED IN MEASUREMENT 


good test items that it is possible to construct. But within these 
limits, reliability can be increased as needed by lengthening the test. 

One special type of lengthening is represented by increasing the 
number of raters who rate an individual or a product he has pro- 
duced. If several raters of equal competence or equal familiarity 
with the ratee are available, a pooling of their ratings will produce 
increase. will 
be described by the same formula we have just been considering. 

4. Operations Used for Estimating. How high a value will be ob- 
tained for the reliability coefficient depends also upon which of the 
several possible sets of experimental operations was used to estimate 
the reliability. We saw in Table 5.2 that the different procedures 
treat different sources of variation in different wavs, and that it is 
only the use of parallel forms of a test with a period intervening 
that includes all four sources of variation in "error." That is, this 
procedure of estimating reliability represents a more exacting defini- 
tion of the test's ability to reproduce the same score. The individual 
must then show consistency both from опе sample of tasks to another 
and from one day to another. We have gathered together a few 
examples that show reliability coefficients for the same test when 
these were computed by two different procedures. These are shown 
in Table 6.6. 

The two procedures compared in Table 6.6 are correlation of alter- 
nate forms and correlation of half-tests made up from a single form. 


increased reliability in the composite rating, and this 


Table 6.6. Comparison of Reliability Coefficients Obtained from Equivalent 
Forms and from Fractions of a Single Test 


Alternate Single 
Test Forms Test 
Otis Quick-Scoring Intelligence Test—Beta .84 .90 
Pintner-Durost Intelligence Test 
Scale 1, Picture Content .78 .92 
Scale 2. Reading Content .92 .97 
Essential High School Content Battery 
Mathematics .88 ‚92 
Science EG .85 
Social Studies .85 .89 
English .86 .90 


It will be noted that the alternate-forms correlation is lower in every 
case. This is consistent with our earlier discussion, in which we 
pointed out that the alternate-forms procedure constitutes a more 
demanding test of an instrument's precision. The difference be- 


HOW HIGH MUST THE RELIABILITY OF A MEASUREMENT BE? 139 


tween the two procedures varies from test to test, being as small as 
0.04 in one instance and as large as 0.14 in another. But in every 
instance, it is necessary to discount the odd-even correlation. 


HOW HIGH MUST THE RELIABILITY OF A MEASUREMENT BE? 

being equal, the more reliable our measur- 
tisfied we are with it. A question that 
minimum reliability that is acceptable? 
answer to this question. If we must 
course of action with respect to an 
individual, we will do so in terms of the best information we have, 
provided only that the reliability is 
better than zero. (Of course, here as always the crucial considera- 
tion is the validity of the measure.) The appraisal of any new pro- 
cedure must always be in terms of other procedures with which it is 
"Thus, a high-school mathematics test with a re- 
k relatively unattractive if tests 
available. On the other 


. Obviously, other things 
ШК procedure is, the better за 
is often raised is: What is the 
Actually, there is no general 
make some decision or take some 


however unreliable it may be, 


In competition. 
liability coefficient of 0.80 would loo 
with reliabilities of 0.85 to 0.90 were already 
hand, a procedure for judging "leadership" that had a reliability of 
no more than 0.60 might look very attractive if the alternative were 
d ratings having a reliability of 0.45 to 0.50. 

an absolute minimum for the reliability 
we can indicate the level of reliability 
specified levels of accuracy 
dual or a group. Suppose that we have given 


à test to two individuals, and that individual А fell at the 75th per- 
centile of the group while individual B fell at the 50th percentile. 
What is the probability that A would still surpass B if they were 
In Table 6.7 the is shown for different 


à set of uncontrolle 

Although we cannot set 
of a measurement procedure, 
that is required to enable us to achieve 


in deseribing an indivi 


probability 
coefficient. Thus, where the correlation is 
ftv chance that the order of our two 
individuals will be reversed. When the correlation is 0.50, the prob- 
ability of a reversal is 1 in 3. For a correlation of 0.90, there is still 
chance in 12 that we will get a reversal on repetition of the testing. 
To have 4 chances in 5 that our difference will stay in the same direc- 
ability of about 0.80. 

situation when we are comparing two 
A the average fell at the 75th per- 

whereas in class B the average 
probability is that we 
Here we still have 


tested again? 
values of the reliability 
0.00, there is exactly а fifty-f 


tion, we require a reli 

Table 6.7 also shows the 
groups of 25. That is. in class 
centile of some larger reference group, 
fell at the 50th percentile. We ask what the 


۴ А геге repeated. 
Would get a reversal if the were repee 


testing ) H 
а ffty-fifty chance when the lation is 0.00. дне, the BEF 


corre 


140 QUALITIES DESIRED IN MEASUREMENT 


curity of our conclusion increases much more rapidly as the reliability 
of our test is increased. When the reliability is 0.50, the probability 
of reversal is already down to 1 in 20; with a correlation of 0.70 it 
is only 1 in 1000. Thus, a test with relatively low reliability will 


Table 6.7. Per Cent of Times Direction of Difference Will Be Reversed in 
Subsequent Testing for Scores Falling at 75th and 50th Percentile 


Per Cent of Reversals with Repeated Test 


Reliability Scores of Single Means of Groups Means of Groups 


Coefficient Individuals of 25 of 100 
.00 50.0 50.0 50.0 
40 40.3 10.9 0.7 
.50 36.8 4.6 0.04 
.60 32.5 1.2 
.70 27.1 0.1 
.80 19.7 
.90 8.7 
.95 2,2 
98 0.05 


permit us to make useful studies of and draw accurate conclusions 
about groups, but relatively high reliability is required if we are to 
have precise information about individuals. 


EFFECTS OF UNRELIABILITY ON CORRELATION BETWEEN VARIABLES 
There is one further effect of unreliability which merits brief atten- 
tion here because it affects our interpretation of the correlations be- 
tween different measures. Let us think of a measure of reading 
comprehension and one of arithmetic reasoning. In cach of these 
tests, the individual differences in score are due in part to "true" 
ability and in part to chance “errors of measurement.” But if the 
errors of measurement are really chance matters, the reading test 
errors and the arithmetic test errors must be uncorrelated. There is 
no relationship between one toss of a coin and a later toss of a coin. 
So we have these uncorrelated errors in the total score. This means 
that they must water down any correlation that exists between the 
true scores. That is, the actual scores are a combination of true 
score and error, so the correlation between actual scores is a com- 
promise between the correlation of the underlying true scores and 
the 0.00 correlation that characterizes the errors. 
We would like to extract an estimate of the correlation between 
the underlying true scores from our obtained data in order to under- 
stand better how much the functions involved have in common. For- 


ECONOMY 141 


tunately, we can do this quite simply. Such an estimate is provided 


by the formula 
ДЕ 
ete V rures i 


n 


where туо is the correlation of the underlying "true" scores. 

71» is the correlation of the obtained scores. 

rii and raa are the reliabilities of the two measures in question. 
ation between our reading test and arithmetic test 


Thus, if the correl 
fficients of the tests are respectively 


is 0.56, and the reliability coe 


0.71 and 0.90, we have 
0.56 


"eem а 
=e 4/(0.71)(0.90) 


lation between error-free measures of 
0.70. In thinking of these two 
o think of the correlation as 0.70 
only 0.56. 


0.70 


Our estimate is that the corre 
arithmetic and reading would be 
Junctions, it would be appropriate t 
rather than 0.56, though the tests correlate 


FACTORS MAKING FOR PRACTICALITY IN 
ROUTINE USE 

y be all-important in measures 

cial research purposes, when a test is to 

a school or school system a number 

ust also be taken into 


Though validity and reliability ma 


that are to be used for spe 

be used in classrooms throughout 
of down-to-earth practical considerations m 
account. Though it is easy for the administr 
i o economies of time that make 


tention to small financial savings or t j 1 r 
it possible to fit a test into the standard class period with no shifting 
of schedules, these factors of economy and convenience are real con- 


siderations. Furthermore, there are other factors relating to the 
readiness with which the tests may be given, scored, and interpreted 


that bear more importantly on the use that will be made of the tests 
-usions that will be drawn from them. 


ànd the soundness of the conc 


ator to pay undue at- 


ECONOMY 

The practical significance 
emphasized. Dollars are of very re 
Or industrial enterprise- Economy 
Part on cost per copy. It depends їп p 
the test booklets over again- From the hig 
Possibly even in the upper elementary grades, it ! 


of dollar savings does not need to be 
al significance for any educational 
in the case of tests depends in 
art on the possibility of using 
junior high school on, and 


eme! s feasible to ad- 


142 QUALITIES DESIRED IN MEASUREMENT 


minister a test using a separate answer sheet. Such a separate an- 
swer sheet permits reuse of the test booklets. If a test will be used 
in successive years or if testing can be scheduled so that different 
classes or schools will be tested on successive days, an important 
economy can be effected by using the same test booklet over again 
several times. s 

A second aspect of economy is saving of time in test administra- 
tion. However, this is often false economy. We saw in the previous 
section that the reliability of a test depends on the length of the test. 
As far as testing time is concerned, we get about what we give. 
Some tests may be a little more efficiently. designed, so that they 
give a little more reliable measure per minute of testing time, but, 
by and large, any reduction in testing time will be accomplished. at 
the price of loss in the precision or the breadth of our appraisal. 

A third, and quite significant, aspect of economy is case of scoring. 
The clerical work of scoring a battery of tests can become either 
burdensome if it is done by the already busy teacher or expensive if 
it is carried out by clerical help hired for the purpose. A well- 
designed test should be planned so as to simplify and speed up the 
scoring operation. In tests for voung children in the first two or 
three grades of school, there is not a great deal that can be done to 
streamline scoring procedures. Any attempt to separate the answers 
from the problems, so that the answers will be more convenient to 
score, is likely to confuse the voung child and affect his score. By 
the upper elementary grades, however, it is practical to provide 
answer spaces at the side of the page, preferably the right, so that 
all answers appear in a column and can be scored by placing an 
answer key beside them. 

The separate answer sheet referred to in an carlier paragraph, and 
also discussed and illustrated in Chapter 4, represents a further 
major economy in time. It completely. eliminates time-consuming 
turning of pages. When score is the number correct, the test can be 
scored using a simple stencil with holes punched in the spaces corre- 
sponding to the right answers, which is placed over the answer sheet. 
There are also special types of answer sheets prepared to further 
simplify the scoring operation. Three main types should be noted. 


1. Carbon-Backed Answer Sheets. (Clapp-Young, Scoreze, etc.) 
In these, two sheets are fastened together. On the inside certain 
parts of one or both sheets are covered with carbon. When the 
examinee marks in the answer spaces, the marks аге transferred. to 
the inside of the page by the carbon paper. The inside has the key 
printed upon it, in the form of boxes or circles placed opposite the 


FEATURES FACILITATING TEST ADMINISTRATION 143 


correct answers. Scoring consists merely of counting the number of 
marks that appear in the boxes. 

2. Pin-Prick Answer Sheets. These operate in essentially the same 
Way, except that a pin is pushed through the answer sheet in the 
This technique is especially effective in the case of 
a multiscore test. It has been used with the Kuder Preference Record, 
where the pin is pushed through several sheets of paper, each one of 
which is printed with the scoring key for a different interest area. 
Counting the number of holes appearing within the printed circles 
on the different sheets gives the score for the different areas of in- 
terest without the necessity of using key or stencil. 

3. The IBM Answer Sheet and Test-Scoring Machine. For a num- 
ber of years the International Business Machines Corporation has 
made available a test-scoring machine that operates electrically 
through the conductivity of pencil marks on a special answer sheet. 
The sheet has 750 answer positions, which may be grouped in dif- 
ferent ways but which most commonly represent 150 five-choice test 
ts must be rather carefully marked with a 
developed for the purpose, if they 
Various other mechanical difficulties have 
current leakage due to a damp cli- 
when these conditions are watched for, the machine 
can considerably accelerate large-scale test-scoring jobs. The IBM 

Ї on а rental basis, and in 1955 the basic 
This means that the equipment must 
e if it is to pay for itself. It is 
and fairly steady 


specified place. 


Items. The answer shee 
soft pencil, preferably a special one 
are to score accurately. 
been encountered, for example, 


Mate. However, 


Machine is available only 
rental was $40.00 per month. 
be used quite a good deal of the tim 
especially useful in organizations having a large 
flow of test scoring. 

MINISTRATION 


usability of a test, one factor to be 
A test that can be 


FEATURES FACILITATING TEST AD 
In evaluating the practical 
taken into account is the сазе 
handled adequately: by the regular 


than a session or so of special briefing. 
est requiring spe 


to the ease 


of administration. 
classroom teacher with no more 
is much more readily fitted 
into a testing program than a t oS ш 
trators. Several factors contribute of giving and taking 


a test. 
as clear, full instructions. The 


ld be written out substantially 
must do is read them and 
should also be complete 
The amount of 


give if it h 
jinistrator shou 
examiner 
xaminee 
ctice exercises. 


. l. А test is easy to 
a A 

nstructions for the adn 
word for word, so that all the 


follow them, Instructions for the е 
appropriate pra 


a ‘ А 
and should provide 


144 QUALITIES DESIRED IN MEASUREMENT 


practice that should be provided depends upon how novel the test 
task is likely to be for those being tested. Where it is a familiar 
tvpe of task or a simple and straightforward instruction, no morc 
than a single example will be needed. However, for an unusual item 
format or test task more practice will be desirable. 

2. A test is casy to give if the number of units to be separately 
timed is few, and close timing is not critical. Timing a number of 
brief subtests to a fraction of a minute is a bothersome undertaking, 
and the timing is likely to be inaccurate unless a stop watch is avail- 
able for each tester. Some tests have as many as eight or ten parts, 
each taking only 2 or 3 minutes. A test made up of three or four 
parts, with time limits of 5, 10, or more minutes for each, will be 
easier to use. 

3. The layout of the test items on the page has a good deal to do 
with the ease of taking the test. Items in which response options 
are all run together on the same line, items with small or illegible 
pictures or diagrams, items that are crowded together, and items 
that run over from one page to the next all make difficulty for the 
examinee. Print and pictures should be large and clear. Response 
options should be well separated from one another. All parts of an 
item and all items referring to a single figure, problem, or reading 
passage should appear on the same page or double-page spread. 
Shortcomings on any of these points represent black marks against 
a test as far as ease of taking it is concerned. 


FEATURES FACILITATING INTERPRETATION AND USE OF SCORES 


It seems axiomatic, though the point is sometimes overlooked, 
that a test is given to be used. If the score is to be used, it must be 
interpreted and given meaning. The author and publisher of the 
test have the responsibility of providing the user with information 
that permits him to make a sound appraisal of the test in relation 
to his needs and to give appropriate meaning to the score of an 
individual. This they do primarily through the test manual and 
other collateral materials that are prepared to accompany the test. 
What may the test user reasonably expect to find in the manual for 
a test, together with its supporting materials? We have outlined 
below the aids we believe the test user should expect. 


1. A Statement of the Functions the Test Was Designed to Measure 
and of the General Procedures by Which It Was Developed. This is 
the author's statement of what he considers the test to be valid for 
and the evidence that proper steps have been taken to achieve that 


FEATURES FACILITATING INTERPRETATION AND USE OF SCORES 145 
validity. Particularly for achievement tests, in which we are con- 
cerned primarily with content and construct validity, the author 
should tell us the procedures by which he arrived at his choice of 
5 of the functions being measured. If he is un- 


content or his analy 
willing to expose his thinking to our critical scrutiny, we may per- 
haps be skeptical of the thoroughness or profundity of that thinking. 

Procedures involve not only the rational procedures by which 
range of content or types of objective were selected, but also the 
empirical procedures by which items were tried out and screened for 
final inclusion in the test. 


2. Detailed Instructions for Administering the Test. We have dis- 
ed for this aid to uniform and easy 


cussed in an earlier section the ne 
rs who will have to use the test. 


administration by the teachers or othe 
3. Scoring Keys and Specific Instructions for Scoring the Test. The 


Problems of scoring have also been discussed, under the heading of 


economy. The manual and supporting materials should provide de- 
score will be computed, how errors 


will be combined into a total 
be planned to facilitate as 


tailed instructions as to how the 
will be treated, and how part scores 
Score, Scoring keys and stencils shoulc 
ible the onerous task of scoring. 

Reference Groups, together with informa- 
d and instructions for their use. 


much as pos 
4. Norms for Appropriate 


tion as to how they were obtaine 
Chapter 7 is devoted to a full consideration of types of test norms 


and their use. It will, therefore, be sufficient at this time to point 
Out the responsibility of the test producer to develop suitable norms 
for the groups with which his test is to be used. General norms are 
a necessity, and norms suitable for special types of communities, 
c | groups, and other more limited subgroups will 
st in many cases. 

of the Test. 


Special occupationa 
add to the usefulness of a te 
. 5. Evidence as to the Reliability 
indicate not only the bald reliabili 
tions used to obtain the reliability ¢ 3 
statistical characteristics of each group on which reliability Чага аге 
based, If a test is available in more than one form, it is highly de- 
Strable that the producers report the correlation between the two 
forms, in addition to any data that were derived from a single test- 
ing. If the test yields part scores: and particularly if it is proposed 
that any use be made of these part scores, reliability data should be 
reported for the separate part scores. It i$ good procedure for the 
author to report stand measurement аз well as reliability 
Coefficients. An author what the standard error of 


ids : s is particularly to 
Neasurement is at each of P : 


This evidence should 
tv statistics but also the opera- 
'stimates and the descriptive and 


ard errors of 
who indicates 


a number of score level 


|46 QUALITIES DESIRED IN MEASUREMENT 


ye commended, since this information shows over what range of 
scores the test maintains its accuracy. 

6. Evidence on the Intercorrelations of Subscores. If the test pro- 
vides several subscores, the manual should provide evidence on the 
ntercorrelations of these. This is important in guiding the interpre- 
ration of the subscores and, particularly, in judging how much con- 
dence to place in differences between the subscores. If the scores 
are correlated to a substantial degree, measuring much the same 
things, the differences between them will be largely meaningless and 
uninterpretable. e 

7. Evidence on the Relationships of the Test to Other Factors. In so 
far as the test is to be used as a predictive device, correlations with 
criterion measures constitute the essential evidence on how well it 
does in fact predict. Full information should be provided on the 
nature of the criterion variables, the group for which data are avail- 
able, and the conditions under which the data were obtained. Only 
then can the reader fairly judge the validity of the test as a predictor. 

It will often be desirable to report correlations with other meas- 
ures of the same function as collateral evidence bearing on the 
validity of the test. Thus, correlations with individual intelligence 
test score are relevant in the сазе of a group intelligence measure. 

Finally, indications of the effect of age, sex, tvpe of community, 
socio-economic level, and similar facts about the individual or the 
group are often helpful. "They provide a basis for judging how sensi- 
tive the measure is to background of the group and to circumstances 
of their life and education. 

8. Guides for Using the Test and for Interpreting Results Obtained 
with It. The developers of a test presumably know how it is reason- 
able for the test to be used and the results from it to be evaluated. 
They are specialists in that test. For the test to be most useful for 


others, especially the teacher with limited specialized training, sug- 
gestions should be given of ways in which the test results may be 
used for diagnosing individual and group weaknesses, forming class 
groupings, organizing remedial instruction, counseling with the in- 
dividual, or whatever other activities may appear appropriate for 
that particular type of instrument. 


SCHEDULES FOR EVALUATING A TEST 


The potential user, who is trying to select the best test for a par- 
ticular purpose, might welcome a standard form or procedure for 
evaluating the various tests that are candidates for his patronage. 


SCHEDULE FOR EVALUATING A TEST 147 


A standard and somewhat objective procedure for rating tests would 
be very attractive if an appropriate one could be devised. There 
have been several attempts to apply the technique of quantification 
to tests themselves, and score cards have been developed to be used 
in appraising tests.^^*? These allocate so many points to aspects of 
validity, so ‘many to factors associated with reliability, so many to 
ease of use and interpretation, and so forth. : 

One can question how useful this standard scheme of adding up 
points is in this situation. Certainly, if a test has low validity, no 
amount of elegance and polish in other respects can make it a satis- 
factory instrument. And the importance of different qualities fora 
measure varies, depending upon the purpose for which the instru- 
For that reason, we are not proposing any nu- 
merical scheme for arriving at a score on each test being considered. 
However, a svstematic outline should help in assuring that the sig- 
hificant factors are all taken into account and that the analis is 
organized in such a way that comparison of different pan п B 
facilitated. The schedule given below provides such an outline. 
questions raised in the outline, the po- 


tential user should have a good basis for comparing the suitability 


for his needs of different available measurement dedi * 

An extensive and analytically critical set of criteria tor Bu edipi Е 
est has been developed by the Committee on 
rican Psychological Association, and pub- 
article gives a full statement of the 
uted test may be expected to 


Ment is to be used. 


answers are sought to all the 


able psychological t 
l'est Standards of the Ame | 
lished by the Association.“ This : | 
standards that a commercially distri › V 
simi sducational tests \ 
Meet. Similar standards for educa ts é аге 
0 ; and the National 
the American Educational Research Association аг 
i ised in Education. 
Council on Measurements l sed in Educe 


SCHEDULE FOR EVALUATING A TEST 


GENERAL REFERENCE INFORMATION 


Name of test. 

. Author's name (and p 
Publisher. 

Date of publication. 
>. Cost. 

6. Time for administration. 


osition, if available) 


— 2 — 


t 


VALIDITY What were the procedures for de- 


А. Evi ^ Test. i 
i А. Evidence from the Plan ee e, е арена embark to 
ermining the scope of the test? Fe > 


148 QUALITIES DESIRED IN MEASUREMENT 


be covered? For determining the functions and processes to be represented? 
How adequate do these appear to be? How closely do the test objectives 
correspond to objectives that you are interested in for your school? 

What provisions were made for editorial review of the test materia 
adequate do these appear? 

B. Evidence from the Test Blank Itself... Do the test items appear appropriate 
for the objectives that you are trying to evaluate? Do the test items appear 
to be well constructed? Are they free from ambiguity? Do they have at- 
tractive wrong-answer choice 

C. Evidence from Statistical Studies of the Test in Use. With what concur- 
rent measures has the test been correlated? For what sort of groups? How 
substantial are the correlations? 

With what later criterion measures has the test been correlated? For what 
sorts of groups? 

How does the evidence on statistical validity compare with that for other 
tests? 

How accurate a prediction does it give of significant outside criteria? How 
do these results compare with those of other tests that try to measure the 
same trait? 

D. Evidence from Outside Authority. What have reviewers and critics said 
about the validity of the test? 


How 


RELIABILITY 


A. How Adequately Are Data Reported? Do the authors indicate size and 
nature of groups for which data are reported? Do they indicate type of re- 
liability coefficient computed? Do they give mean and standard deviation 
for the groups? Do they report reliabilities for single age and grade groups? 

B. What Are the Facts on Reliability? What actual data on reliability are 
reported? (Indicate, as far as given. the age or grade, size of group. mean 
and standard deviation. procedures by which reliability was computed, and 
resulting values obtained.) How do the data compare with other competing 
tests: 


PRACTICAL CONSIDERATIONS IN ADMINISTRATION AND USE OF TEST 
A. Factors in Administration 
1. Adequacy of manual. 
2. Complexity of procedures. 
a. Complexity of process required of students. 
b. Adequacy of instructions and practice exercises. 
c. Complexity of process required of examiner. Timing, giving in- 
structions, and interpreting responses of subjects examined. 
3. Time requirements. 
4. Legibility, attractiveness, and convenience of format. 


B. Factors in Scoring 


1. Time required (i.e. form of answer, type of key, etc.). 
2. Special skills required (subjective scoring and qualitative interpre- 
tation). 


SUMMARY STATEMENT 149 


C. Factors in Interpretation 
Appropriateness to Uses, completeness, representa- 


1. Type of norms. 
How readily may raw scores be converted in o 


tiveness of sample. 


derived scores? 
2. Aids to interpretation provided by manual. 


D. Factors in Continued Use 
1. Are there comparable forms? How many? How well is compara- 
bility established? 


2. Cost. Does this permit routine continued use? 


SUMMARY STATEMENT 


nts of a good test under the head- 
A test is valid in so far 
It is reliable in so 


. We have discussed the requireme 
Ings of validity, reliability, and practicality. 
as it measures the qualities we wish to measure. 
far as it measures with precision. It is practical in so far as it is 
economical of time and money and simple to give and interpret. 
The crucial requirement for a test is validity. Sometimes we may 
have to judge the validity of a test by rational analysis. We may 
examine how well its content corresponds to the objectives of our in- 
Struction, or we may analyze the functions or processes it measures 
to see how well they correspond to the concept or construct that we 
i Sometimes we may gather statistical evi- 
tical evidence will usually be in the form 
The correlation may be with a 
The correlation may indicate 


have set out to appraise. 
dence of validity. The statis 
of correlations with other measures. 
concurrent measure of the same sort. 


Prediction of some later criterion score. . Ке : 
There are several different procedures available for obtaining esti- 


mates of the reliability or pre a measure. The most rigorous 
Procedure is to administer two equivalent forms of the test оп two 
The correlation between the two forms provides 
lls how closely 
testing to the other. Less exact- 
of the same test and (2) extract- 
by scoring odd and cven 
based on these last pro- 
sually be discounted some- 


cision of 


Separate occasions. ЖНГ ial ud 
a reliability coefficient. that te individuals maintain 
their position in the group from one 
Ing procedures include (1) repetition | 
Ing two scores from а usually 


Items separately. Reliability 
and should u 


single test, 
estimates 


cedures are less satisfactory 
What. cana й 

liability coefficient will depend on 
ested and the length of 


edure used for estima- 


e re 
in the group t 
articular proc 


The value obtained for th 
the range and level of ability 


the test, as well as upon the P 


150 QUALITIES DESIRED IN MEASUREMENT 


tion. It is particularly necessary to discount a coefficient based on 
the pooling of several grades. 

To describe the accuracy of an individual's score, the standard 
error of measurement is often preferable to the reliability coefficient. 
It tells the variation to be expected if we were able to make repeated 
measurements of a particular individual. This variation must always 
be borne in mind when interpreting the score an individual receives. 

Practicality is a function of economy, case of administration, and 
readiness of interpretation. Economy is affected by initial cost, by 
the possibility of reusing materials, and by time required for scoring 
and analyzing the results. Ease of administration results from full 
directions, simple procedures for the examinee, and an objective rec- 
ord of performance. Readiness of interpretation is facilitated by 
good norms and by a full guide of suggestions for interpretation. 

The potential test user should examine the tests from among which 
he must choose in the light of the above criteria and pick the one that 
best fits his needs. 


REFERENCES 


1, American Psychological Association, Committee on Test Standards. 
Technical recommendations for psychological tests and diagnostic techniques. 
Psychol. Bull., 1954, 31, No. 2, Pt. 2: 

2. Buhler, R. X., Flicker fusion threshold and. anxiety. level, unpublished 
doctor's dissertation, Columbia University, 1953. 

3. Cole. R. D., and F. von Borgersrode, A scale for rating standardized 
tests, Sch. of Educ. Rec. of Univ. of North Dakota, 1928 (Oct.), 14, 11-15. 

4. Otis, А. S., Scale for rating tests, Yonkers, N. V., World Book, 1926. 

5. Rinsland, H. D., Form for briefing and evaluating standardized tests, 
J. educ. Res., 1949, 42, 371 375, 

6. Terman, L. M., and Maude A. Merrill, Measuring intelligence, Cam- 
bridge, Mass., Houghton Mifflin, 1937. 


SUGGESTED ADDITIONAL READING 


American Psychological Association, Technical recommendations for psycho- 
logical tests and diagnostic techniques, Washington, D. C., American Psycho- 
logical Association, 1954. 

Anastasi, Anne, Psychological lesting, New York, Macmillan, 1954, Chapters 
5 and 6. 

Bennett, George K., Harold C. Seashore, and Alexander G. Wesman, Dif- 
ferential aptitude tests manual, New York, Psychological Corporation, 1932. 
Chapters 4 and 5. 

Cureton, Edward E., Validity, Chapter 16 in E.. F. Lindquist, editor, Edu- 
cational measurement, Washington, D. C., American Council on Education, 
1951. 


QUESTIONS FOR DISCUSSION 151 


Thorndike, Robert 1... Reliability, Chapter 15 in E. F. Lindquist, Editor, 
Educational measurement, Washington, D. C., American Council оп Educa- 
tion, 1951. 

Tiffin, Joseph. Industrial psychology, 2nd ed.. New York, Prentice-Hall, 
1947, pp. 46-80. 

Wesman, А. G., Expectancy tables—a way of interpreting test validity, 
Test service bulletin, Psychological Corporation, 1949, No. 38, 1-5. 


QUESTIONS FOR DISCUSSION 


1. If the College Entrance Examination Board were developing a general 
survey test in science for high-school seniors, what might they do to estab- 
lish the validity of the test? 

2. What type of validity is indicated by each of the following statements 
which might be found in a test manual? 

a. Scores on Personality Test X correlated +0.43 with teachers’ ratings of 

adjustment. 

b. The objectives to be appraised by Reading Test Y were rated for im- 

portance by 150 classroom teachers. 

c. Scores on Clerical Aptitude Test Z correlated +0.57 with supervis 

ratings after 6 months on the job. 

d. Intelligence Test W gives scores that correlate +0.69 with Stanford- 

Binet J. O. 

е. Achieveme 

and 100 courses of study from all parts of the U. S. 


rs 


nt Battery V is based on an analysis of 50 widely used texts 


3. Comment on the statement “ The classroom teacher is the only one who 
can judge the validity of an achievement test for his class." 

4. Look at the manuals of two or three tests of different types. What evi- 
dence on validity is presented? How adequate is it for each test? 

5. Using Table 6.3 on p. 123, determine what per cent of those selected 
would be above average on the job if a selection procedure with a validity of 
0.40 were used and only the top quarter were accepted for the job. What per 
cent would be above average if the top three-quarters were selected? What 
would the two per cents be if the validity were 0.50? What does a compari- 
son of the four percentages bring out? 

6. Air Force personnel psychologis' s 


are doing research on the selection of 
aviation mechanics. What might they use as criterion measures of success as a 
mechanic? What are the advantages and limitations of each possible measure? 

7. What advantages and disadvantages do school grades have as criterion 
measures? 

8. Look at the evidence pri 
three tests. How adequate is it? 

9. The manual for test T prese! 
with the same test form a week later. ( 


esented on reliability in the manuals of two or 
What are its shortcomings? 
nts reliability data based on (a) retesting 
b) correlating odd with even items, and 
(c) correlating form A with form B. the two forms being given a NE apart: 
Which procedure may be expected to vield the lowest coefficient? Why? 
Which to yield the most useful estimate of reliability? Why? i 
10. A student has been given the Stanford-Binet Intelligence Test four dif- 
ferent times during his school career, and his cumulative record card shows 


152 QUALITIES DESIRED IN MEASUREMENT 


the following 1.0.78: 98, 107, 101. and 95. What significance should be attached 
to the fluctuations in I. 7 

11. A school plans to give form A of a reading test in October and form B 
in May, so as to measure pupil growth in reading during the year. Both 
forms have a standard error of measurement of 6 points, and the average 
gain during the year was 15 points. What significance does the standard 
error of measurement have when the teacher starts to interpret the gains for 
individual pupils? 

12. You are considering three reading tests for use in your school. As far 
as you can judge, the three are equally valid. The reliability of each is re- 
ported to be 0.90, What else would you need to know to make a choice 
among the tests? 

13. Examine several tests of intelligence or of achievement. that would be 
suitable for a class you are teaching or might teach. Write an evaluation of 
one of these tests, following the guide on pp. 147-149. 


Chapter 7 


Norms and Units for 
Measurement 


THE NATURE OF A SCORE 


Johnny got a score of 15 on his spelling test. What does that 


mean, and how should we interpret it? 

Actually, as it stands it has no meaning at all and is completely 
At the most superficial level, we don't even know 
whether this represents a perfect score, i.e., 15 out of 15, or a very 
low per cent of the possible, i... 15 out of 50. But even supposing 
we do know that it is 15 out of 20, or 75 per cent, what then? i 

Look at Table 7.1. This shows two 20-word spelling tests. A 
score of 15 would have vastly different meaning if it were on test A 


uninterpretable. 


Table 7.1. Two 20-Word Spelling Tests 


Test А Test B 
һаг baroque 
cat catarrh 
form formaldehyde 
jar jardiniere 
nap naphtha 
dish discernible 
fat fatiguing 
sack sacrilegious 
rich ricochet 
sit citrus 
feet feasible 
act accommodation 
rate inaugurate 
inch insignia 
rent deterrent 
lip eucalyptus 
air questionnaire 
rim rhythm 
must ignoramus 
red accrued 


153 


154 NORMS AND UNITS FOR MEASUREMENT 


than on test B. A person who got only 15 right on test A would 
Try test B out 
on some friends or classmates. You will probably not find many of 
them who can spell 15 of these words correctly. When this test 
was given to a class of graduate students, only 22 per cent of them 
spelled 15 of the words correctly. A score of 15 on test B is a good 
score among graduate students of education. 

As it stands, then, a score of 15 words right, or even of 75 per 
cent of the words right, can have no meaning or significance. It 
gets meaning only as we have some standard with which to com- 
pare it. 

In the usual classroom test, the standard operates indirectly and 
imperfectly, partly through the teacher's choice of tasks to make up 
the test and partly through his standards for evaluating the re- 
sponses. Thus, the teacher picks tasks to make up the test that he 
considers to be appropriate to represent the learnings of his group. 
No teacher in his right mind would give test A to a high-school 
group or test B to third graders. Where the responses vary in 
quality, as in essay examinations, the teacher sets a standard for 


not be outstanding in a second- or third-grade class. 


grading that corresponds to what he considers it reasonable to expect 
from a group like his. Quite different answers to the question “What 
were the causes of Hitler's rise to power?" would be expected from a 
ninth grader and from a college history major. 

However, the inner standard of the individual teacher is very sub- 
jective, inaccurate, and unstable. Furthermore, it provides no basis 
for comparing different classes or different areas of ability. We have 
no answers to such questions as: Are the children in school A better 
in reading than those in school B? Is Mary better in reading than 
in arithmetic? Is Johnny doing as well in algebra as we should ex- 
pect? We need some broader, more uniform, objective and stable 
standard of reference if we are to interpret psychological and educa- 
tional measurements. 

Let us take another look at our tests A and B. Suppose, now, 
that we were to combine them into a single 40-word test and to give 
that test to 20 pupils in cach grade from second through twelfth. 
What would we find? We would soon see that above the second or 
third grade almost everybody would get the first 20 words right. 
But until we got well up the grade ladder, children would get very 
few of the second set. It doesn't take much gain in spelling ability 
to improve from a score of 10 to опе of 20 on this particular test, but 
to improve up to a score of 30 represents quite a respectable accom- 
plishment. The two 10-point gains don't begin to be equal. The 


THE NEED FOR NORMS 155 


units on our scale of scores cannot be considered equal units, then. 
We have a rubber vardstick that has been stretched out at some 
points and squeezed in at others. 

There is one further point that we should make about our spelling 
Scores. Let us consider test B, since the point will be most clearly 
and obviously true in this case. A person who fails to get any of 
the items right on test B cannot be said to fall at an absolute zero 
of spelling ability. Actually, he may be able to spell hundreds, pos- 
sibly thousands, of words. So a person who gets 10 words right on 
test B doesn't demonstrate twice as much spelling ability as a per- 
On this test, as in an iceberg, the great 


son who gets only 5 right. 
bulk of what we are examining lies below the surface and can't be 
Seen. We cannot guarantee that even test А gets down to a true 
zero point. In fact, it would be hard to say what a real zero point 
is in spelling ability. 


THE NEED FOR NORMS 


We must look, then, for some better type of unit in which to ex- 
test results than a raw count of units of score or a crude per- 


pr 
We would like the units to have these 


centage of possible score. 
properties: 

1. Uniform meaning from test to test, so that a basis of comparison 
is provided through which we may compare different tests—e.g., dif- 
ferent reading tests, a reading test with an arithmetic test, or an 
achievement test with a scholastic aptitude test. 

2. Units of uniform size, so that a gain of 10 points on one part of 


the scale signifies the same thing as a gain of 10 points on any other 


part of the scale. E m Р 
3. A true zero point of “just none of" the quality in question, so 


that we can legitimately think of "twice as much as" or "two-thirds 


„ 


as much as.“ 
The different types of norms that have been developed for tests 
represent marked progress toward the first two of the above objec- 
tives. The third can probably never be reached for the traits with 
which psychological and educational measurement is concerned, We 
| xints of butter on one side of a pair of scales, 


can put five 1-pound | x | 

and they will balance the contents of a s- pound bag of flour pourec 
3 : “ КЕ " : 

into the other. “No weight" is truly "no w eight," and units of 


weight can be added together. But we don't have that type of zero 
Point or that wav of adding together in the case of educational and 


156 NORMS AND UNITS FOR MEASUREMENT 


psychological measurement. If you put together two morons, vou 
will not get a genius, and a pair of bad spellers will not win a spell- 
ing bee. 

Basically, a raw point score can be given meaning only by refer- 
ring it to some type of group or groups. A score is not high or low, 
good or bad; it is higher or lower, better or worse. There are two 
general ways that we may relate a person's score to a more general 
framework. One way is to compare him with a graded series of 
groups and see which one he matches. Each group in the series 
usually represents a particular school grade or a particular chrono- 
logical age. The other way is to find where he falls in a particular 
group, in terms of the per cent of the group he surpasses or in terms 
of the group's mean and standard deviation. Thus, we find four 
main patterns for interpreting the score of an individual. These are 
shown schematically in Table 7.2. We shall consider each in turn, 
evaluating its advantages and disadvantages. 


Table 7.2. Main Types of Norms for Educational and Psychological Tests 


Туре of Norm Туре of Comparison Type of Group 

Age norms Individual matched to group he Successive age groups. 
equals. 

Grade norms Same as above. Successive grade groups. 

Percentile norms Per cent of group surpassed by Single age or grade group 
individual. to which individual be- 

longs. 
Standard score Number of standard deviations Same as above. 
norms individual falls above or below 


average of group. 


AGE NORMS 


For any trait that shows a progressive change with age, we can 
prepare a set of age norms. The norm for any age, in this sense, is 
the average value of the trait for persons of that particular age. Let 
us take the example of height. If we get a representative sample of 
8-year-old girls, measure the height of cach, and get the average of 
those measures, we determine the norm for height for that age group. 
Note that in this case the norm is nothing more than the average 
value. It is not the ideal value. Nor is it the value to be expected 
of each person. It is simply the average value. It will pay to re- 
member this in thinking about age and grade norms. 

The average height can be determined in the same way for 9-year- 
olds, 10-year-olds, and each other age group. The values will fall on 


AGE NORMS 157 


some such curve as that shown in Fig. 7.1. Points for the curve 
will ordinarily be computed only for full-year groups, but the curve 
is to be considered continuous. That is, we can estimate points in 
between the year groups by referring to the continuous curve. Thus, 


70 9n 


60 


50 


BÉ 
o 


Height in inches 


w 
=] 


20 


10 


ل ا الا — 

6 8 10 12 14 16 18 20 
Age in years 

Fig. 7.1. Girls' age norms for height. (Adapted from Boynton.!) 


in Fig. 7.1 the height 55 inches corresponds to (or is average for) the 
age 10. 

We can refer any height measurement to this scale and find for 
what age it would be average. Any girl's height can be interpreted 
rage height for a girl of a particular age. Thus, the 
nt of 60 inches can be described as being as tall 
If we also know how old the girl 


аз being the ave 
girl who has a heig 


as the average girl of 12 years. 
actually is, we can judge whether she is tall, average, or short for 


her age, Thus, if Mary is 55 inches tall and is only 8 years old, we 


158 NORMS AND UNITS FOR MEASUREMENT 


know that she is tall for her age. Her height is average for a 10- 
year-old. 

The age framework is a relatively simple and familiar one. "He 
is as big as a 12-year-old” is a common way of describing a voungster. 
For a trait that shows continuous and relatively steady growth over 
a period of years, the age framework is a convenient one. [ts fa- 
miliarity and convenience are its major advantages. Age norms 
have a number of disadvantages, which we must now consider in 


more detail. 

The big issue in using age norms is whether we can reasonably 
think of a year's growth as representing a standard and uniform unit. 
Is the growth from age 5 to age 6 equal to the growth from age 15 
to age 16, and similarly for cach age on our scale? As we push up 
the age scale, we soon reach a point where we sce that the year's 
growth unit is clearly inappropriate. "There comes a point, some 
time in the teens or carly twenties, when growth in almost апу trait 
we can measure ceases. Man no longer gets taller or stronger, or 


scores higher on tests of mental ability. As this age is approached 
growth gradually slows down. A уса 


by any sensible interpretation to be much less than a year's growth 


growth in this region seems 


earlier on the scale. The failure of the unit "one year's growth" to 
have uniform meaning is most apparent as one considers the ex- 
tremes of age, but there is no guarantee that this unit has uniform 
meaning even in the intermediate range. 

The problem introduced Ьу the flattening growth curve is most 
apparent when we consider the individual who falls far above the 
average. What is the height-age of the youth who is 6 feet 4? The 
average individual never reaches that level. What mental age shall 
we assign to the Phi Beta Kappa who can handle intellectual tasks 
that the average individual can never do at any age? Any assign- 
ment of age values is an arbitrary and artificial one for cases like 
these and has no relationship to a real age framework. 

It is also true that growth curves are not entirely comparable for 
different functions. Rate of growth and time of reaching а maxi- 
mum differ substantially. How shall we compare age scores on a 
vocabulary test and a maze-tracing test, for example, if the first 
continues to rise up to and into the twenties, while the second 
reaches a maximum in the early teens? For a 10-year-old to have 
reached the 12-year-old level may represent appreciably different 
degrees of superiority for different traits. 

Two years’ acceleration may also have quite different meaning, 
depending on the age level at which it occurs. A 5-year-old who is 


AGE NORMS 159 


аз tall as the 7-vear norm is much more outstanding than the 10- 
year-old who reaches the 12-year norm. This fact has led to the 
development of the intelligence quotient and other types of quotients 
(which we shall consider presently) to allow for age differentials in 
the examinee. But the basic problem of equality of the age unit 
throughout the age scale still remains. 

Of course, age norms are primarily appropriate for traits that de- 
pend on general normal growth. A trait showing no continuous im- 
provement over the age range (such as acuity of vision) cannot pos- 
sibly be expressed in terms of a scale of age units. One that de- 
pends primarily upon specific educational experiences, such as facility 
in arithmetical operations, seems to be more reasonably related to 
the educational framework of school grades than to the biological 
framework of years of growth. 

Finally, though it does not directly concern the consumer of tests, 
it is worth noting that from the viewpoint of the test producer age 
norms present some serious practical problems. It is often difficult 
to get together a truly representative sample of individuals of a 
given age. Thus, if one wanted a cross-section of 12-year-olds one 
would have to look for some of them in the elementary school and 
some in the junior high school. They must at least be assembled 
Then as one moves toward the 


from quite a range of school grades. 
ach is widely scattered—some 


older ages the sample one needs to re 
in school, some at college, some in the military establishment, and 
some in the world of work. To reach a representative sample of 
18-vear-olds, for example, is a very forbidding task. This is one 
norms for tests become suspect as 


More reason why the usual age 


one moves up into the teens. 
which are based on the performance of 


level, provide a readily comprehended 
of a particular individual. 


In summary, age norms, 
the average person at each age 
framework for interpreting the performance 


However, the equality of the age units is open t 
adulthood, age ceases to have 


o serious question. 
Аз one goes up to adolescence and 
any meaning as a unit in terms of which to express level of per- 
formance. Age norms are most appropriate for the elementary- 
School vears and for abilities that grow as а part of the general 
development of the individual. Physical and physiological charac- 
teristics such as height, weight. and dentition, and psychological 
traits such as general intelligence appear to be ones for which this 


type of norm is most acceptable. 


160 NORMS AND UNITS FOR MEASUREMENT 


GRADE NORMS 


Grade norms have many of the characteristics of age norms, dif- 
fering only in that the reference groups are grade groups instead of 
age groups. That їз, a test is given to representative groups in cach 
of a series of school grades, and the average score is determined for 
each grade. Scores lying between the norm for two successive grades 
are assigned fractional credits by interpolation. The standard ter- 
minology assigns the value 5.0 to average performance at the begin- 
ning of the fifth grade, 5.5 to average performance at the middle of 
the grade, and so forth. A representative table of grade norms for 
the reading test of the Metropolitan Achievement Test Battery is shown 
in Table 7.3. Thus, in this table a raw score of 26 corresponds to 


Table 7.3. Grade Equivalents of Raw Scores for Reading Test of 
Metropolitan Achievement Test—Intermediate Level 


Raw Grade Raw Grade Raw Grade 
Score — Equiv. Score Equiv. Score Кашу. 
60 11.24- 40 6.3 20 4.6 
59 10.6 39 6.2 19 4.6 
58 10.1 38 6.1 18 4.5 
57 9.7 37 6.0 17 4.4 
56 9.3 36 5.9 16 4.3 
55 9.0 35 5.8 15 4.2 
54 Bv 34 VÉ 14 4.1 
53 8.4 33 5.6 13 4.1 
52 8.2 32 S53 12 4.0 
51 8.0 31 5.4 11 3.0 
50 1.8 30 25.3 10 3.8 
40 7.6 20 5.9 9 3 
48 7.4 28 2.2 8 3.6 
47 1.2 27 5.1 7 3946 
46 7.1 26 5.0 6 3.5 
45 7.0 25 4.9 5 3.4 
44 6.8 24 4.9 4 ded 
43 6.7 23 4.8 3 3.2 
42 6.6 22 4.7 2 3.0 
41 6.5 21 4.7 1 3:0— 


Copyright 1946 by World Book Company. Reproduced by permission of 
the World Book Co. E 


the performance of the average child at the beginning of the fifth 
grade, a raw score of 37 is average for beginning sixth grade, while 
32 is average for the middle of grade five. 


GRADE NORMS 161 


Grade norms have much the same limitations as age norms. The 
equality of the grade units is quite suspect. Especially in those areas 
in which definite instruction is sporadic or is stopped at a relatively 
carly grade, one questions the equality of gains from grade to grade. 
In some arithmetical or spelling skills, for example, gains may be 
negligible during the high-school years. і 

The slowing down of gains at the upper grade levels makes it very 
ess the performance of а very able child in terms of 
Many a superior child in the seventh or eighth 
ignated 11+ in terms of grade norms for 
That is, his performance surpasses that 
for which norms are mean- 


difficult to expr 
this framework. 
grade can only be de: 
standard school subje 
of the average child in the highest grade 
ingful. 

A further caution 
pretation of grade norms. 


must be introduced with respect to the inter- 
Consider a bright and educationally ad- 
vanced child in the third grade. Suppose we find that on a stand- 
ardized arithmetic test he gets a score for which the grade equivalent 
is 5.9, This does not mean that our child has a mastery of the arith- 
metic taught in the fifth grade. He got a score as high as that gotten 
by the average child at the end of the fifth grade, but this higher 
score was almost certainly obtained in part by superior mastery of 
third-grade work. The average child is sufficiently slow and inac- 
curate that a number of score points (and consequently a higher 
grade equivalent) can be earned merely by real mastery at his own 
grade level. This is worth remembering. The fact that our child 
nt of 5.9 need not mean that the child is ready 


has a grade equivale 
work. It is only the reflection of a 


to move ahead into sixth grade 
Il in what way that score was attained. 

termine, since they are based 
groups already established in the school or- 
ganization. In the directly academic areas of achievement, the con- 
is perhaps a more meaningful one than age level. 
placement that a child's performance in 
used and interpreted. Outside of the 


score and does not te 
Grade norms are relatively easy to de 


on the administrative 


сер{ of grade leve 
It is in relation to his grade 
these areas is likely to be 
school setting, grade norms have little meaning. 

à grade norms, which relate the performance of an 
Р child at each grade level, are useful 


k for interpreting the academic 


To summarize, 


individual to that of the average 
framewor 
n in the elementary school. For this pur- 


nient and meaningful, even though we 
equality of grade units. They 


primarily in providing à 
accomplishment of childre 
pose they are relatively conve 
nce in the 


cannot place great confide 
of groups or measures. 


have little value for other types 


162 NORMS AND UNITS FOR MEASUREMENT 


PERCENTILE NORMS 


We have just seen that in the case of age and grade norms we give 
meaning to an individual's score by determining the age or grade 
group in which he would be just average. But it will often make 
more sense to compare him to his own age or grade group—to a 
group of which he may legitimately be considered a member. This 
is the type of comparison we make when we use percentile norms. 

We saw in Chapter 5 how we could compute for any set of scores 
the median, quartiles, and any percentile. For cach score value, we 
can compute the per cent of cases, p, falling below that score. Any 
person getting that score then surpasses P per cent of the group on 
which the percentile values were computed. We will вау that he 
falls at the pth percentile, or has a percentile rank of p. 

Table 7.4 shows percentile norms for the eight subtests of the 
Differential A pritude Test Battery. Look at the column headed Verb. 
Reas. (Verbal Reasoning). The entries are scores. Thus, a score of 
24 corresponds to the 75th percentile. An individual who gets this 
score surpasses 75 per cent of the group on which the norms were 
based. A score of 17 corresponds to the 50th percentile on this test. 
On the Abstract Reasoning Test (Abs. Reas.), a score of 26 corresponds 
to the 50th percentile. This score represents the same degree of ex- 
cellence as the score of 17 on the Verbal Reasoning Test. 

Note that not every percentile is given in Table 7.4. For most of 
the range, the percentiles are given by steps of 5, and sometimes 
several score points correspond to the particular percentile valuc. 
If more detailed tables were given, these scores would correspond to 
different percentiles. However, locating an individual to the nearest 
5 percentiles is close enough for all practical purposes. (Remember 
the standard error of measurement.) 

Percentile norms are very widely adaptable and applicable. They 
can be used wherever an appropriate normative group can be ob- 
tained to serve as a vardstic 


They are appropriate for voung or 
old, educational or industrial situations. To surpass 90 per cent of 
the reference comparison group signifies a comparable degree of ex- 
cellence whether the function being measured is how rapidly one can 
solve simultaneous equations or how far one can spit. Percentile 
norms are widely used. Were it not for two points we must now 
consider, they would provide a very nearly ideal framework for inter- 
preting test scores. 

The first problem that faces us in the case of percentile norms is 
that of the norming group. On what type of group should the norms 


PERCENTILE NORMS 163 


he beac? f y : iff 

e based? Clearly, we will need different norm groups for different 
ages and grades in our population. А 9-vear-old must be evaluated 
in terms of 9-year-old norms; a sixth grader, in terms of sixth-grade 


Table 7.4. Percentile Norms for Differential Aptitude Tests 
Е — 
BOYS aw S T " 
о Raw Scores IN = 6900 * 
R 4 - У 
Pereentite | Verb. Num, Abs Space Mech. Cierieal LUT: LU-II; 
Reas. Abii. Reis. Ress Samda Spell. Sent. | Pementile 
G ж 
R 
A oo и+ 35+ 44+ 87+ 60+ 73+ 90+ 59+ 99 
D 97 36-40 32-34 41-43 81-86 56-59 66-72 80-89 52-58 97 
95 33-35 30-31 39-40 75-80 53-55 62-05 72-79 47-51 95 
E 90 30-32 27-29 37-38 09-74 50-52 59-61 63-71 42-46 o0 
85 27 20 25-26 35-36 64-68 48-49 57-58 56-62 38-41 85 
80 25-26 23-24 34 60-63 46-47 55-56 51-55 35-37 80 
75 24 22 32-33 56-59 44-45 53 54 47-50 33-34 T 
70 2225 21 31 53-55 42-43 52 42-46 31-32 70 
o5 21 1% 20 30 49-52 41 51 38-41 29-30 65 
60 19 20 18 29 45-48 39-40 50 34-37 27-28 60 
55 18 17 27 28 41-44 37-38 48-49 31-33 25-26 55 
50 17 16 26 37-40 35-36 47 26-30 22-24 50 
45 16 15 24-25 33-36 34 46 23-25 20-21 45 
40 15 14 B 29-32 32-33 44-45 20-22 18-19 40 
35 1 12-13 21-22 25-28 30-31 43 16-19 16-17 35 
30 13 11 19-20 21-24 28-29 42 13-15 14-15 30 
25 12 10 16-18 17-20 26-27 40-41 9-12 12-13 25 
20 10-11 9 13-15 14-16 23-25 38-39 6-8 — 9-11 20 
15 9 7-8 9 12 11-13 20-22 36-37 2-5 6-8 15 
10 7-8 5-6 4-8 7-10 16-19 33-35 1 2-5 10 
5 6 3-4 1-3 3-6 3 — 1 5 
3 ie si 8 1-2 ° 0 3 
1 0-3 0 = 0 == = 1 
Mean i& dé Deal Mean 
SD 8.7 8.2 11.3 23.4 12.6 10.5 24.1 14.6 SD 


ion of the Psychologi- 


Reproduced by permi: 


Copyright 1952 by the Psychological Corporation. 


cal Corporation. 

as stock clerk, in terms of stock-clerk- 
norm group is in every case the 
and in terms of which his 
a medical- 


norms; an applicant for a job 
applicant norms. The appropriate 
Sroup to which the individual belongs 
Status is to be evaluated. It makes no sense to compare 


school applicant with norms based on unselected adults. 

rcentile norms, then, we must have multiple 
appropriate for each distinct 
This is 


If we are to use pe 
We must have norms 


sets of norms 
orms. 
h our test is to be used. 


type of group or situation with whic 


164 NORMS AND UNITS FOR MEASUREMENT 


recognized by the better test publishers, who provide norms not only 
for different age or grade groups but also for special types of educa- 
tional or occupational populations. However, there are limits to the 


number of distinct groups for which a test publisher can produce 


norms. 

Published percentile norms will often need to be supplemented by 
the test user, who can build up norm groups particularly suited to 
his individual needs. Thus, a given school system will often find it 
valuable to develop local percentile norms for its own pupils. This 
will permit interpretation of individual scores in terms of the local 


group, a comparison that may be more significant for local prob- 
lems than comparison with the national norms. Again, an employer 
who uses a test with a particular category of job applicants may 
well find it useful to prepare norms for this particular group of 
people. Evaluating a new applicant will be much facilitated by 
these strictly local norms. 

The second problem in relation to percentile norms is more serious. 
Again, we are faced by the problem of equality of units. Can we 


} 


750 Р Fo № 


Fig. 7.2. Normal curve, showing selected percentile points. 


think of 5 percentile points as representing the same amount through- 
out the percentile scale? Is the difference between the 50th and 55th 
percentile equivalent to the difference between the 90th and 95th? 
'To answer this, we must notice the way in which test scores for a 
group of individuals usually pile up. We saw one histogram of scores 
in Chapter 5 (p. 86). This picture is fairly representative of the 
way the scores fall in many cases. There is a piling up of scores 
around the middle scores and a tailing off at either end. The ideal 
model of this type of score distribution, which is called the normal 
curve, was also considered in Chapter 5 (pp. 97-99) and is shown in 
Fig. 7.2. The exact normal curve is an idealized mathematical model, 
but many types of tests and measures distribute themselves in some- 
thing that approximates a normal curve. You will notice the piling 


STANDARD SCORES 165 


up of most of the cases in the middle, the tailing off at both ends, 
and the symmetrical pattern. 

In Fig. 7.2, four score points have been marked. These are, in 
order, the 50th, 55th, 90th, and 95th percentiles. Note that near 
the median the 5 per cent of cases (the 5 per cent lying between the 
50th and 55th percentile) fall in a tall narrow pile. Toward the tail 
of the distribution the 5 per cent of cases (the 5 per cent between 
the 90th and 95th percentile) make a relatively broad low bar. Five 
cases spread out over a considerably wider range of 
The same number of per- 


per cent of the 
scores in the second case than in the first. 
centile points corresponds to about 3 times as many score points 
when we are around the 90th to 95th percentile as when we are near 
the median. The further out on the tail we go, the more extreme 


the situation becomes. 

Thus, percentile units are typically and systematically unequal. 
The difference between being first or second in a group of 100 is 
Many times as great as the difference between being 50th and 51st. 
Equal percentile differences do not represent equal differences in 
amount. Any interpretation of percentile ranks must take into ac- 
count the fact that our scale has been pulled out at both ends and 
squeezed in the middle. Mary, who falls at the 50th percentile in 
arithmetic and the 55th in reading, shows a trifling difference in 
whereas Alice, with percentiles of 90 and 95, 


these two abilities, 
shows a marked difference. 

Percentile norms, to conclude, provide 
in terms of his standing in some particular 


is to be meaningful, the group must be one 
and appropriate to compare him. We 
will usually need a number of tables of percentile norms based on 
У if we are to use a test with different ages, grades, 
or occupations. As long as percentiles for appropriate groups are 
of norm is widely applicable. But interpretation 
difficult by the fact that we have 
units are small in the middle 


a basis for interpreting the 


score of an individual 
group. If the percentile 
with which it is reasonable 


different groups, 


supplied, this type 
of percentile values is made more 
а systematically "rubber" scale whose 


range and large at the extremes. 


STANDARD SCORES 


a score system based on percentiles are so 
led to look for some other unit that does 
range of values. Stand- 


Because the units of 
clearly not equal, we are ? 
have the same meaning throughout 1t5 whole 


been developed to serve this purposc. 


ard-score scales have 


166 NORMS AND UNITS FOR MEASUREMENT 


In Chapter 5 we developed the standard deviation as a measure 
of the spread or scatter of a group of scores. The standard deviation 
was a type of average of the deviations of scores away from the 
mean—the root-mean-squared deviation. Scores may be expressed 
in standard deviations away from the mean. Thus, if the mean of a 
set of scores is 65 and the standard deviation is 15, a score of 80 is 
1 standard deviation above the mean. A score of 35 is 2 standard 
deviations below the mean. In standard deviation units, we could 
call them +1.0 and —2.0 respectively. 

Suppose we have given two tests to a group. The means and 
standard deviations for the group are shown below, as are the scores 
made by Johnny and Mary. 


Test A Test B 
Mean 65 40 
Standard deviation 15 10 
Johnny's score 77 55 
Mary's score 87 48 


Let us sce how we can use standard scores to compare performances 
on the two tests or of the two individuals. 

On test A, Johnny is 12 points above the mean, or 12 15 = 0.8 
standard deviations above the mean. On test B he is 15 points, or 
15/10 = 1.5 standard deviations above the mean. Thus, Johnny 
does a good deal better on test B than on test А. For Mary, the 
corresponding calculations give | 


Hs 87 — 65 48 — 40 
Test A: ——— = 1.5 Test B: ^ = 0.8 
0 


Thus, we may 


ay that Mary did as well on test A as Johnny did on 
test B, and vice versa. Each pupil's level of excellence is expressed 
as so many standard deviation units above or below the mean of the 
comparison group. This is a standard unit of measure having essen- 
tially the same meaning from one test to another. For aid in inter- 
preting the degree of excellence represented by a standard score, sec 
Table 5.7 (p. 99). А 

The type of score in standard deviation units that we have just 
presented is satisfactorv except for two matters of convenience: 
(1) it requires us to use plus and minus signs which may be mis- 
copied or overlooked, and (2) it gets us involved with decimal points 
which may be misplaced. We can get rid of the need to use decimal 
points by multiplying every standard deviation score by some con- 
stant, such as 10. We can get rid of minus signs by adding to every 


STANDARD SCORES 167 


score a convenient constant amount such as 50. Thus, for Johnny's 
scores on test A and test B, we have 


Test A Test B 
Mean of distribution of scores 65 40 
Standard deviation of distribution 15 10 
Johnny's raw score 77 55 
Johnny's score in standard deviation units +0.8 +15 
Standard deviation score X 10 +8 +15 
Plus a constant amount (50) 58 65 


A table of standard scores for test А, based on this conversion, in 
which the mean is set equal to 50 and the standard deviation to 10, 


is shown in Table 7.5. 


Table 7.5. Standard-Score Equivalents for Test A 


(Standard score mean = 50, 5.0. = 10) 


Raw Standard Raw Standard Raw Standard 
Score Score Score Score Score Score 
120 87 80 60 40 33 
115 83 75 57 35 30 
110 80 70 53 30 27 
105 65 50 25 23 
100 73 60 47 20 20 

95 70 55 43 15 17 

90 67 50 40 10 13 

85 63 45 37 5 10 


We could have used values other than 50 and 10 in setting up our 
conversion into convenient standard scores. The Army has used a 
vith mean of 100 and standard deviation of 20 
The College Entrance Examination 
Board has long used a scale with mean of 500 and standard deviation 
of 100. The Navy has used the 50 and 10 system. The particular 
А a matter of convenience. What is 
and comparable norming groups be 
so that results from different 


standard-score scale v 
for reporting its test. results. 


choice of scale is arbitrary and 
important is that the вате scale 
used for all tests in the organization, 
tests may be directly comparable. 

Some standard-score scales аге ‹ Ў 
н rg iiri scores. The test maker assumes that the trait 
j v distributed in accordance with the normal 
‘a normal distribution of scores in his group, 
use the raw-score units in which his test 
nt equal units throughout the 
discussion of this point in 


leveloped via the percentiles corre- 


he is measuring is basicall 
curve, If he does not get a1 
he assumes that this is beca 
scores were expressed did not represe 


- 2 ro B our 
range of scores. You will remember our 


168 NORMS AND UNITS FOR MEASUREMENT 


connection with our spelling test (pp. 154-155). He therefore takes 
steps to make his distribution of standard scores normal—he xor- 
malizes it. The actual calculations make use of percentiles and of 
tables of the normal curve. We shall not illustrate the details of 
procedure here. 

These standard scores have the distinctive feature that they are 
guaranteed to have a normal distribution, at least for a population 
comparable to that on which the original norms were obtained. The 
score scale has been stretched in some places and squeezed together 
in others so that finally a normal distribution results. This process 
of stretching and squeezing can take care of any inequality in the 
original units at different raw-score levels in the test. If the original 
assumption of a normal distribution was justified, this transforma- 
tion will produce a scale in which a point of score really represents 
the same amount at any point on the scale. These are normalized 
standard scores. The term T-score which the reader of testing litera- 
ture quite often encounters refers to this type of normalized standard 
score based on a single age group. 

In summary, standard scores, like percentiles, base the interpreta- 
tion of the individual's score on his performance in relation to a 
particular reference group. They differ from percentiles in that they 
are expressed in presumably equal units. The basic unit is the num- 
ber of standard deviation units above or below the mean of the group. 
Different numerical standard-score scales have been used by different 
testing agencies. 


INTERCHANGEABILITY OF DIFFERENT TYPES 
OF NORMS 


Whichever type of norm is used, a table of norms will be prepared 
by the test publisher. This will show the different possible raw 
scores on the test, together with the corresponding score equivalents 
in the system of norms being used. Many publishers will provide 
tables giving more than one type of score equivalent. An example 
is given in Table 7.6. Here we see the norms for the Arithmetic 
Problems Test of the revised Metropolitan Achievement Test Battery. 
All four types of norms are shown. The percentiles are based on a 
group tested carly in the sixth grade. The standard-score scale 
assigns a mean of 200 and a standard deviation of 20 to a mid-sixth- 
grade group. "Thus, a boy with a score of 20 can be characterized as 


1. Having an age equivalent of 12 „cars. 9 months 
2. Having a grade equivalent of 7.3. 


INTERCHANGEABILITY OF DIFFERENT TYPES OF NORMS 169 


3. Falling at the 76th percentile in the sixth-grade group. 
4. Receiving a standard score of 211. 


From Table 7.6 it is easy to see that the different systems of 
norms are different ways of expressing the same thing. We can 
translate from one to the other, moving back and forth. Thus, a 
child who falls at the 80th percentile in the sixth-grade group has an 


Table 7.6. Norms for Metropolitan Achievement Test, Intermediate— 


Form R, Arithmetic Problems 
oth Grade 


Percentile 
Raw Grade Age Standard (Nov. 
Score Кашу. Equiv. Score Testing) 
30 11.2 16-74 248 
29 10.6 16-0 243 
28 10.0 15-5 238 98 
27 9.5 14-11 234 97 
26 9.0 14-6 230 95 
25 8.6 14-1 226 93 
24 8:2 13-9 222 90 
23 Tug. 13-5 219 86 
22 hell 13-2 216 83 
21 75 12-11 214 80 
20 13 12-9 211 76 
19 1 12-6 209 73 
18 6.9 12-4 206 68 
17 6.7 12-2 203 63 
16 6.5 12-0 201 58 
15 6.3 11-9 198 53 
14 б.2 11-7 195 48 
13 6.0 11-5 193 44 
12 5.8 11-3 190 39 
11 5.6 11-0 187 33 
10 5.5 10-10 184 27 
9 553 10-8 181 22 
8 5.1 10-6 178 17 
7 4.9 10-4 175 14 
6 n 10. 172 11 
E 4.6 9 10 109 8 
4 й 3 90 165 6 
3 3.9 9-1 159 4 
2 3.4 8-6 153 2 
1 3.0— 8-0— 145 


Copyright 1946 by the World Book Company. Reproduced by permission. 


170 NORMS AND UNITS FOR MEASUREMENT 


An age 
equivalent of 12-11 corresponds to a standard score of 214. The 


age equivalent of 12-11 and a grade equivalent of 7 


different svstems of interpretation support опе another for different 
purposes. 


QUOTIENTS 


After age norms had been used for mental tests for a few years, 
the need was felt to convert the age score into an index that would 
express rate of progress. The 8-year-old who had an age equivalent 
of 1015 years was obviously better than average, but how much 
bette 
age (actual time lived) as well as the age equivalent on the test 
1 reached). 


Some index was needed to take account of chronological 


(score levi 

The expedient was 
age to yield a quotient. This procedure has been applied most ex- 
tensively with tests of intelligence where the age equivalent we are 


hit upon of dividing test age by chronological 


concerned with is a mental age and the corresponding quotient Is an 
intelligence quotient. However, it has also been used to some extent 
for achievement tests and for some other sorts of measures. 

The formula for computing the intelligence quotient is given below 
and is illustrated for the 8-year-old who reaches the 10 car 
level on the test. M 

100.1/.А. 
О. = 
C.A. 
100(10.5) 
8 


A similar quotient could be computed for a reading test, general 
achievement battery 


measure of strength, or any other testing in- 
strument that yields age norms. The resulting value would be called 
a reading quotient (R. O.), educational quotient (E.Q.), or the like. 
How does an intelligence quotient come to have meaning? In the 
first place, it is obvious by the way in which the quotient Was estab- 
lished that 100 should be average at every age group, since the aver- 
age 10-year-old, for example, should fall exactly at the 10-vear level 
on our test if the age equivalents were properly established. But 
2 


how outstandingly good is 1 How poor is 80? Such questions 
as these can only be answered by becoming acquainted with the 
distribution of quotients that a particular test yields. 

The intelligence quotient was originally developed in connection 
with the individual intelligence test of the type represented by the 


QUOTIENTS 171 


Stanford-Binet (see Chapter 9). А typical distribution of intelligence 
atest revision of that test, based upon the standard- 


quotients for the! 
This table shows the per cent of 


ization group, is shown in Table 7.7. 


Table 7.7 Distribution of Revised Stanford-Binet l.Q.'s 


Per Cent Cumulative 

LQ. Range of Cases Per Cent 
140 and over 3 99.9 
130-139 1 98.6 
120-129 2 95.5 
110-119 1 87.3 
100-109 5 69.2 
90- 99 0 45.7 
80- 89 8 ЭТ 
70- 79 .6 8.2 
60- 69 .0 2.6 

.6 0.6 


Below 60 


From L. M. Terman and M. A. Merrill, Measuring. intelligence, Boston, 


Houghton Mifflin Co., 1937. 


cases falling within cach 10-point 1.Q. interval and the cumulative 
percentage through each interval. Thus. 1.3 per cent of cases got 
I.Q.'s of 140 and over, 3.1 per cent from 130 to 139, and so forth. 
An 1.Q. of 125 would surpass roughly 91 per cent of the group (fall 
at the 91st percentile), whereas one of 80 would surpass only about 
8 per cent. The mean for this particular distribution of J. O.'s is 


101.5, and the standard deviation is 16.3. 
The circumstance that makes intelligence quotients from such a 
test as the Stanford-Binet relatively interpretable is that the mean 
and standard deviation remain relatively uniform from age to age. 
Thus, an LO. of 125 signifies about the same status, relative to his 
n ther obtained for a 5-year-old or a 15-year-old. 
This situation is not nec ssarily true and is not perfectly true even 
for this test, but in many instances quotients maintain the same 
ad of values in different 
on interpretation is 


own age group, whe 


average and spre age groups sufficiently 
closely so that a comn 
levels. 

To all intents at 
sase of the Revise 


appropriate at all age 


id purposes. quotients represent a type of standard 
d Stanford-Binet, we have a standard 
rately 100 and standard deviation of 
sample of American children. This 
is explicitly recognized 
tables of I. O. equiva- 


score. In the < 
score with a mean of approxin 
16 in a general 
its to standard scores 
For these. 


approximately’ 
relationship of quotiet 


in some recent intelligence tests. 


172 NORMS AND UNITS FOR MEASUREMENT 


lents have been set up at each age level. These have been built so 


as to give a common mean and standard deviation for all age groups. 


The quotients yielded by different tests are, unfortunately, not 
exactly equivalent. A variety of factors in the test and in the selec- 
tion of norming groups have led to somewhat different means and 
standard deviations of intelligence quotients. Some evidence on the 
variability of quotients for three widely used tests for high-school 
groups is presented in Table 7.8. Experience with a test in a particu- 


Table 7.8. Equivalent LQ.'s on (1) Terman-McNemar Test of Mental 
Ability, (2) Otis Quick-Scoring Mental Ability Tests: Gamma Test, and 
(3) Pintner General Ability Tests: Verbal Series, Advanced Test 


(From Lennon ?) 


Terman J. O. Otis J. O. Pintner LQ. 
140 132-133 136-137 
130 123 126 
120 116 118 
110 107 108 
100 96-97 
90 85 
80 74 
70 67 


lar community setting will provide a further basis for interpreting 
quotients at different levels. 

The notion of the intelligence quotient is deeply imbedded in the 
testing movement. Quotients seem to have little advantage over 
standard scores and percentiles, to which they are intimately related. 
It is only as a specific type of standard score that a quotient has any 
precision of meaning. This precision is considerably blurred by the 
variation from test to test. Eventually we may hope that quotients 
will be completely replaced by standard scores and that a uniform 
and universal standard-score scale will be accepted, in terms of which 
all test results will be reported. What is needed is for all producers 
of tests for widespread distribution to agree upon a common score 
scale. 


PROFILES 


The various types of norms we have been considering provide a 
means of expressing scores on quite different tests in common units 
in such a way that they can be directly compared. There is no 
direct way of comparing a score of 30 words correctly spelled with 


173 


PROFILES 


(ччәүолоЁә apoı6 so papsoras 592025) 


зәр үциәшәләцгү иошододәүү 104 W10} р2оэәз 501) N "Bu 


"19 


s 


xonunuog PjO4OH y 


4 


uaqo) 10209 ¢ 


75159 Asuay 


uossapuy һәр 1 


S3MWVN SUdNd 


qx 2ouaihjjoruy uo шош] pouruasop aq pinoy Dj puo soy үотиәрү "eruapoombo 280 40 pou Јо rus us рәрг099 99 pmoys 


uOOdA SSV'IO 


ces wor nurn e puoi вот) вт? UJ 


NORMS AND UNITS FOR MEASUREMENT 


174 


(hoang iss 01120455 jo vorssiuued Aq pe»npojdeg) sj,“ e»ueBijejur бшрзозәз 104 wo "y "б 


Г | [ТҮ 


ст о |= |а ом о/о 


зоп 
-INVI 
NON 


Winn 
W101 


ovn | qwingw 
-9wvi | W104 


xas 3WVN $лапа 
39V ноз ямун 3111832834 o! 
19422] apr " M» 49 
61 чәл) aeq а кайры елмы: bi 100925 
75 po»ueapy. “A'S ayerpauuaquy "A's Amuourop1 "Js Arewug 7735 Aaewudarg 


| qasn 531535 
pa»uvapy әушрәшәјир ^ "AmQuaup] 7077777 Клешид Kreurudoiq 


]eeug piooeu sse e KNEW те}чәрү fo 4591, PTMOFIEO 


PROFILES 175 


one of 20 arithmetic problems solved. But if both scores are ex- 
pressed in terms of the grade level to which they correspond or in 
terms of the per cent of some defined common group that gets scores 
below that point, then they may be compared. The set of different 
test scores for an individual, expressed in a common unit of measure, 
constitute his score profile. The separate scores may be presented 
abular form by listing the converted score values. 
rd forms showing the manner of recording con- 
n in Figs. 7.3 and 7.4. The comparison of dif- 
r by a graphic 


for comparison in t 
Illustrations of reco 
verted scores are give 
ferent subareas of performance is made pictorially cleare 
profile. Several ways of plotting profiles are shown in Figs. 7.5, 7.6, 
and 7.7. 
7.5 shows the form for plotting the subscores of the Cali- 
Each subtest is represented by a row. 
across the top of the form. The 
rformance of the particular in- 
and low points to 


Figure 
fornia Test of Mental Ability. 
The scale of age equivalents appears 
broken vertical line portrays the pe 
dividual, Peaks in performance are to the right 
the left. 

Figure 7.6 shows a similar 
Metropolitan Achievement Test. 
different tests in successive co 
in the vertical dimension. Grade equi 


tical scale. 
Figure 7.7 shows a type of profile chart for the component tests of 


the Differential A ptitude Test Battery. This battery undertakes to 
appraise different aspects of ability important in a high-school 
at in this case the different tests are 
represented by separate bars, rather than points connected by a line. 
The scale used in this case 15 a percentile scale, but in plotting per- 
centile values appropriate have been made for the in- 
equality: of percentile units. | 

spaced at the upper and lower extremes, 1n the same way that the 
d out in the normal curve that was plotted 
the percentile values for an individual 
qual-unit scale. A given linear distance can rea- 
as representing the same amount of ability 
ar the middle of the scale. By the 
isidered equivalent from one 


form for plotting part scores on the 

This form differs in representing the 
lumns and presenting the score scale 
valents are shown on the ver- 


guidance program. Note th 


adjustments 
The percentile values are more widely 


percentile points are space 
. 7.2. By this process, 


are plotted on an с 
sonably be thought of 
whether it lies high, low, or ne 
same token, the same distance can be cor 


been plotted up and down 


test to another. 
7.7 the bars have 
of the 


Note that in Fig. 7. : 
from the 50th percentile. this type of norm, the average 
s the anchor point of the scale, and individual scores 


For 


group constitute 


NORMS AND UNITS FOR MEASUREMENT 


176 


(‘apaing 159] 0102041 JO uoIssiwsed Aq parnposday) 


cad jonpiaip 


HOY? әјуолј g “BLY‏ ر 


WM азама - DYJUDU 1531 VINSOATYO AO Ster- tei ANDHAOI- ма ONION MIN ISIM поздун 5391440 ge 
m" HUONI "8c S3NY S01- OSYATIROB COOMATION 9165 — Dv3ung 1931 VINYOITVD АВ азнегіепа 
ez [e] Le] "ees ose ooe ore 912 OZ el 081 891 91 tI del ow ош 261 ^а obe apio pue 91 av. « 
SN RUN sso) sense -C. 
AS A CCA bg o osz 002 Ost OL 09 051 OF бп 021001 » 
09) ez] lo^] "5555 Tost OTT SSIS s6 06 OF 01 09 05 ea INNO 30¥89 THINI 
0L] [o* 05 “кебй 051 оп 021 011 00 [ 06 08 of 09 05 қ wayenind3 їшәшәзс optio әлү 
OF) (oA = OSE 00E 002 dec 9l? de ы 01 81 950 ҮЙ ozi 902 39V WIIIOTONOUHI 
8] [07] [0 ; i = (4e4240 
[22] [e^] [ 9 ek то её A oF) 9 волоті fte on 
0€] [o8] [09] = М | EA (4949) 
= 07] SO 09 OSS OF © Hi OF] 08 voi 39Vn9NY1 
08| [O |09| ape we PEE 
0^] ortet def 001 ds 095^ o 09 0 Or 21) s" 58019V4 WINJW 1101 g. 
09! Са 22 ST] 9S 51430900 WBA W101 1) EH 
= Е ЕБ 
БН EAE EOS I xU 3 н)“ (9+9) way az 
_— or vU ET 7 ыб. sem kuen) jeouawny 9 EH 
(ox #2) [221 ph таар 185 
s мо N93 (e+e) м0 | Ê 
58 
97) [e] [por] o: а ji 
= senda Lad 
261) (ZE (Z241) .. vd (2+1) W101 ve 
= q popup ~~ saw jo voneyndiwey 2 8 5 
ezg Wa] por] vw 0 oe a 8 == ya pue ui v "ER 
"FO ое 0s 007 081 001 091 051 Obl OF 021001 i 
%. |05 | | 92 | 530095 i ME pnt, | 
8787 58585, SOV эх 0% ODE ore gie 02 261 08 891 960 HT ozi g 
‘ot “ий bam Ка: 
"M Тш 
viva JO AYYWWNS wen 92005 de d 31084 JIISONSVIQ "suononajsur 40) IYNNYW 99s 


D 


CERT EE , MB 


кәк 


ос4/ O, Per 


3 
хә 


«а. ua 


„ шоо 


jo a00 
n 


y 1551 


10 aeg 


(821) 21 ji spero ae: 
монете 


L/ 98v s oounuex3 2 me) Avoir vr jounuex3 59311 ô 3 ONY NEYI M A 'NYAITINS `1 `3 A8 G3S1A30 


ко gan, wipe, wwe une OG, Se рәопелре 


EI 


с 


22 


ratu Са ?"N ë uIIOg-j1OQSg PIUIOJI[?)) 


PROFILES 177 


can be referred to this base level. This type of figure brings out the 
individual's strengths and weaknesses very dramatically. 

The profile chart makes a very effective way of representing the 
scores for an individual. In interpreting profiles, however, several 
cautions must be borne in mind. In the first place, procedures for 
plotting profiles assume that the norms for the several tests are com- 


parable. Age, grade, or percentile scores must be based upon equiva- 


46 


> 
6) 


w 
со 


Grade equivalent 
w 
> 


w 
o 


m 
o 


2.2 


Arith. reasoning 
Otis intelligence 


Vocabulary 
Arith. fund. 
Language 
Spelling 
Grade level 


Reading 


Fig. 7.6. Score profile for Metropolitan Achievement Test Battery and Otis Intelligence Test. 
lent groups for all the tests. The best guarantee of equivalence is, of 
course, a common population used for all tests. This is the situation 
that commonly prevails for the different subtests of a test battery. 
Norms for all are established at the same time on the basis of testing 
The guarantee of comparability of the norms for 
the different component tests is one of the most 5 
of the integrated battery. If separately ep plotte l 
together, one can usually only hope that the groups on % nich 9 7 
were established were comparable and that the proble 115 a 1 0 
picture of relative achievement in the different fields. One solution 
of such a problem would be to devel i 
common population and to plot indiv 


a common group. 


lop one’s own local norms on a 
idual profiles in terms of the 


local norms 

1 i reti ` ups а owns a 
A second problem is that of interpreting the ups and gue d 
profile. The profile focuses our attention upon differences within the 


178 NORMS AND UNITS FOR MEASUREMENT 


individual; we are not now concerned with differences between in- 
dividuals. In Fig. 7.5 we observe that the pupil does better on the 
non-language factors than he does on the language factors. How 
much confidence may we have in that difference? How sure can we 
be that we would get a difference in the same direction if we got the 
same two subscores from another form of the test2 


DIFFERENTIAL APTITUDE TESTS 
REPORT G.K. Bennett, H.G. Seashore, and А.С. Wesman 
FORM 
THE PSYCHOLOGICAL CORPORATION 
NEW YORK 18, N.Y. 
NAME SEX GRADE 
MONA MC DUGAL * P 
PLACE OF TESTING FORM NORMS USED DATE OF TESTING 
SPRINGWELL HIGH SCHOOL ^ GRADE 9-GIRLS 9-25-82 
Raw Score 33 26 35 зэ 34 43 5з 45 
Percentle 95 90 es 60 25 15 60 во 
Standard 
Score Percentile Verbal _ Numerical Abstract — Spice Mechanical Clerical  Speling Sentences Percentile 
^ E? 
70 — 
R — 7 — وو‎ 
= | T е 
» E 
w- 
E E 
^ n 
s- 0m 70 
“ % 
0- w Е] 
о E 
5-ю » 
E E 
20 20 
40 — 
10 10 
E 
А B 
30- 
i 
L. 
Fig. 7.7. Pupil profile chart for Differential Aptitude Tests. Reproduced by permission of 


the Psychological Corporation.) 


The problem now facing us is an aspect of the problem of reliability, 
which we considered in some detail in the previous chapter. How 
reliable is our evaluation of the difference between the individual's 
score on two tests? It is, unfortunately, true that the appraisal of 
the difference between two tests usually has substantially lower re- 
liability than the reliability of the two tests taken separately. This 
is due to two factors: (1) the errors of measurement in both separate 
tests affect the difference score, and (2) whatever is common to both 
measures is canceled out in the difference score. We can illustrate 
the situation by a diagram. Look at Fig. 7.8. 


PROFILES 179 


| Each bar in Fig. 7.8 represents performance * on a test, broken up 
into a number of parts to represent the factors producing this per- 
formance. The first bar represents an intelligence test, and the sec- 
ond a reading test. Notice that we have divided reading performance 
into three parts. One part, labeled “соттоп factors," is a complex 
of general intellectual abilities that operate both in the reading and 
the scholastic aptitude test. A second part, labeled "specific reading 


Intelligence test 


Specific 
Error intelligence Сш 
factors ы 


1 
| 


Reading test 


Specific 
reading Error 
factors 


Common 
factors 


Specific 


S 
RS Z 
Difference score 


Specific 


Specific 


Fig. 7.8. Nature of a difference score. 


factors," is abilities that appear only in the reading test. The third 
part, labeled "error," is chance error of measurement. Three similar 
parts аге indicated for the intelligence test. Now look at the third 
bar, which represents the difference score. In this bar, the common 
factor has disappeared. It canceled out in our process of subtrac- 
tion. Only the specific factors and the errors of m rement re- 
main. These are the factors that determine the difference score. 
And the errors of measurement bulk relatively much larger in this 

wo tests measured exactly the same 


third bar. In the limit, where t 
onlv the errors of measurement would remain in 


common factors, 


* More precisely, variance in performance. 


180 NORMS AND UNITS FOR MEASUREMENT 


the difference scores, and the differences would have exactly zero 
reliability. 

The reliability of the difference between two components of a pro- 
file can be expressed in a fairly simple formula, which reads 


711 + 722 
e etu qu 
2 
Mit. = 1 
= 75 


where 7; is the reliability of one measure. 


rao is the reliability of the other measure. 


rig is the correlation between the two measures. 


Thus, if the reliability of test A is 0.80, the reliability of test B is 
0.90, and the correlation of A and B is 0.60, for the reliability of the 
difference score we have 

0.80 + 0.90 


—— — 0.60 
2 
Про > 
iff 1 — 0.60 
0.25 
= —— = 0.62. 
0.40 
In Table 7.9 the value of rp;g. is shown for various combinations 
r 22 e Я ОС 
of values of 1! I 722 and rı2. Thus, if the average of the reliabilities 


of our two tests (: — ) is 0.80, the reliability of the difference 


score is 0.80 when the two tests have zero intercorrelation, is 0.60 


Table 7.9. Reliability of a Difference Score 


Average of Reliability of Two Tests 


(= + 
Correlation between 2 
Two Tests — SS — 

(лә) 50 60 70 80 .90 .95 
.00 30 .60 .10 .80 .90 .95 
.40 „17 .33 .50 67 83 .92 
.50 .00 .20 .40 .60 .80 .90 
‚60 .00 325 .50 275 .88 
70 00 8 67 83 
80 .00 .50 295 
90 .00 «90 


25 00 


PROFILES 181 


when the intercorrelation is 0.50, and is 0.00 when the intercorrela- 
tion is 0.80. It is clear that, as soon as the correlation between the 
two tests begins to approach the average of their separate reliability 
coefficients, the reliability of the difference score drops very rapidly. 

The low reliability of difference scores is not in any unique way 
related to profiles. However, it comes to the fore as we consider 
profiles, because there our interest focuses on differences in level of 


4.6 


42 


3.8 


3.4 


Grade equivalent 


3.0 


2.6 


2.2 


Otis intelligence 


Arith. reasoning 
Language 
Spelling 

Grade level 


Vocabulary 
Arith. Fund. 


Reading 


Fig. 7.9. Score profile for Metropolitan Achievement Test Battery апа Otis Intelligence Test. 
performance within the same individual. The sharply pictorial na- 
ture of a profile tends to bring differences to our attention, and we 
are enticed into interpreting them. We must beware lest we over- 
interpret. these. differences. Remember that the individual's true 
score may fall anywhere within a range on either side of the score 
that we obtained for him. It would be a healthy corrective if on 
every profile we plotted not the obtained score but rather a bar ex- 
tending one standard error of measurement (or even possibly two) 
on either side of the obtained score. We would then be less likely to 
start remedial instruction or to counsel with respect to а vocational 
plan on the basis of differences that would be likely to evaporate if 
ated with a comparable set of tests. 

7.6 by adding bars of the type just pro- 
The broad bar extends for 1 
h side of the obtained score, 


the testing were гере, 

We have modified Fig. 
posed. This is shown in Fig. 7.9. 
standard error of measurement on eac 


182 NORMS AND UNITS FOR MEASUREMENT 


and the thin line extends for 2. Note how these change the effect 
of the profile. We are impressed now not only by differences but 
also by the overlap in the bars. Differences between reading and 
vocabulary, between arithmetic reasoning and language shrink into 
relative insignificance. Only the spelling test appears to be defi- 
nitely below the others. The tentative nature of the differences is 
clearly brought out. Actually drawing these bars is rather a chore, 
but the user of profiles should trv to keep them in his mind's eve 
and think of a score as representing a band rather than a point. 
Organizing the separate test scores of an individual into a graphic 
profile is, then, a very effective way of dramatizing the high and 
low points in a score pattern. Such a profile may be plotted when- 
ever scores from several different tests are expressed in the same units. 
However, a profile must be interpreted with a good deal of caution, 
because even unreliable differences may look quite impressive. 


USING NORMS 


We have seen that norms provide a basis for interpreting the scores 


of an individual. Converting the score for any test taken singly into 
age or grade equivalent, percentile or standard score, permits an 
interpretation of the level at which the individual is functioning on 
that particular test. Bringing together the set of scores for an in- 
dividual in a common unit of measure, and perhaps exhibiting these 
scores in a profile, provides a descriptive picture of the relative level 
of performance of the individual in different arcas. 

The median performance for a class, a grade group in a school, or 
the children in a grade throughout a school system may be similarly 
reported. We then see the average level of performance within the 
group on the single function or the relative performance of the group 
in each of several areas. Norms provide a frame within which the 
picture may be viewed and bring all parts of the picture into the 
common frame. Now what does the picture mean, and what should 
we do about it? 

Obviously we cannot, in a few pages, provide a ready-made inter- 
pretation for any set of scores that may be obtained in a practical 
testing situation. However, we can lay out a few general guiding 
lines and principles that may help to forestall some unwise interpre- 
tations of test results. The first points are phrased with an eye to 


the interpretation of group results. These are followed by some 
points relating primarily to interpretation of individual scores. How- 


PRINCIPLES GUIDING INTERPRETATION OF GROUP PERFORMANCE 183 


ever, the points overlap somewhat, and each has some reference to 
the other type of situation. 


PRINCIPLES GUIDING INTERPRETATION OF GROUP PERFORMANCE 

1. In Evaluating Average Group Achievement, Consideration Must 
Be Given To Average Ability Level in the Group. & sixth-grade class 
with an average mental age of 10 years could not be expected to do 
arithmetic as well as one with an average mental age of 12 years. 
Some adjustment must be made for the typical ability level. How- 
ever, one must be somewhat conservative in making such adjust- 
ments, especially for classes superior on an intelligence test. The 
correspondence between intelligence and academic achievement is 
not perfect, and a group of bright youngsters will rarely be com- 
parably outstanding in achievement. This will be true particularly 
specialized and less academic subjects, such as spelling 
A group that deviates from average in ability can 
achievement also, 
cted to differ as 


in the more 
or handwriting. 
be expected to differ from the general norm in 
and in the same direction, but it should not be expe 
much in achievement as it does in ability. 

2. A Further Factor That May Be Expected To Influence Achieve- 
ment Is the Type of Cultural Background from Which the Children 
Come. Home and community influences are strong. Foreign-lan- 
guage background, absence of pictures and books in the home, a 
negative family attitude toward schools and schooling may all be 
important. In a measure, these factors affect intelligence test score. 
But they affect achievement also, and perhaps more directly. Where 
a class is atypical in cultural background, either especially favored 
or especially deprived, allowance must be made for this in interpret- 
ing test results. 

3. Group Performance Can Only Be Evaluated in the Light of Cur- 
ricular Content, Emphases, and Objectives. Ifa school system has de- 
instruction in arithmetic until the third grade in 
arlier grades for group projects, 
materials, it is unreasonable to 
s of that system to come up to 
If a school system has 
has cut down or 


layed all formal 
order to provide more time in the е 
social experiences, and preparatory 
expect the children in the third grade 
ade norms in arithmetic. 
ate spelling as an objective, 
eliminated spelling. drills, and has concentrated on other educa- 
it is inappropriate to evaluate that school by rigid 
a standardized spelling test. There 
‘sults themselves that schools 
s have de-empha- 


national third-gre 
de-emphasized асси 


tional outcomes, 
application of national norms 1n 
al of evidence from test r 


is a good dez 
in the more prosperous and privileged communitie: 


184 NORMS AND UNITS FOR MEASUREMENT 


sized the basic tool skills of arithmetic and spelling. These communi- 
ties often do no better in computation and spelling than much poorer 
communities with children of lower intelligence. 

Of course, the communities giving less emphasis to arithmetic and 
spelling in order to achieve other less tangible educational outcomes 
тау not actually be achieving them. Whether they are can only be 
answered as we develop measures for such objectives as ability to 
follow directions, to work alone, to take care of property, to get 
along with other children, or to grow in social relationships, which 
are objectives given emphasis in the stated objectives of these com- 
munities. Instruments for appre 


objectives should receive 
the attention of the measurement specialist and the schools them- 
selves. But one thing is clear. The school's objectives and curricu- 


ing thc 


lar emphases must be taken into account in interpreting standardized 
test results. 

4. Use of Test Results Should Be Constructive, Not Punitive. One 
continually encounters situations in which results on achievement 
tests are used as a basis for evaluating the professional worth of 
teachers. The test then becomes a sword held over the teacher's 
head. a recurring threat to his security. In such a situation, it should 
be no wonder if the test is resented, if the teacher teaches in order 
to "beat the test” or even gives illicit help at the time of testing. 
The teacher is now on the side of the pupils working against the test. 

This type of situation is to be avoided at all costs. The threat 
arises in large measure out of administrative personnel and will dis- 
appear if they see in the tests primarily a tool to help both pupil and 
teacher. This will be facilitated if tests are given in the fall, when 
they can be used to guide the work of the year to come, rather than 
in the spring, to judge the work of the past year. 


PRINCIPLES GUIDING INTERPRETATION OF INDIVIDUAL PERFORMANCE 

1. Here Again, Achievement Must Be Evaluated in the Light of Evi- 
dences of Aptitude. The 12-year-old who is reading at the 9-year 
level is not a reading problem if his mental age is also 9. He is then 
doing just what could be expected of him. Too many remedial 
classes are filled with children who are really performing at or even 
above the level that should be expected for them. 

Again, the intellectually superior child cannot generally be ex- 
pected to be as superior in achievement as he is on the measure of 
intelligence. In the first place, achievement depends upon exposure. 
Even the bright fourth grader cannot be expected to do sixth-grade 
arithmetic if he has never encountered or been taught the processes. 


PRINCIPLES GUIDING INTERPRETATION OF INDIVIDUAL PERFORMANCE 185 


In some subjects, at least, opportunity sets very rcal limits to the 
level which a person can reach. In the second place, abilities are to 
a degree specialized. The child who is picked out because he is 
bright is likelv to be somewhat less outstanding in other specialized 


educational skills. 

2. For Individuals as Well as Groups, We Must Take Account of 
Family and Cultural Differentials. The wide range of variation in 
language background, richness of home resources, and incentives to 
progress in school may be expected to have a great impact on edu- 
cational skills and accomplishments, and allowance for 1.Q. differ- 
ences will only in part take account of these factors. 

3. The Individual Child's. Performance, Too, Must Be Judged in 
Terms of the Curriculum To Which He Has Been Exposed. The in- 
dividual pupils cannot be expected to progress as rapidly in those 
areas in which teaching emphasis is less. Furthermore, in those 
skills that are closely dependent upon instruction, even the able 
pupils cannot be expected to move ahead at a tempo much faster 
than that at which the material is presented. Thus, the bright child 
may be expected to be more advanced in word knowledge and read- 
ing skills, which he can readily pick up on his own, than in the 
processes of arithmetic, which he is unlikely to master until he has 
been exposed to them in the school setting. 

4. In the Case of the Single Individual. We Must Be Acutely Aware 
of the Existence of Errors of Measurement. A test score does not 
identify the exact level of ability for the child. It represents the 
most likely value within a fairly broad band of possible values. Dif- 
achievement must be viewed as tentative 


ferences between arcas of 
as long as these bands overlap. 
two testings—say, two reading tests a few months 
s they are quite substantial. 


Differences between standing on 
apart—should not 
We should be 


excite us unduly unl 
rather conservative in “explaining” differences that may represent 
nothing more than the fallibility of our measuring instrument. 

5. In the Very Nature of Things, by the Way Test Norms Are De- 
veloped We Must in General Expect Half of Our Group to Fall Below 
The norm is the average, the typical. It is neither the 
actory accomplishment nor the standard to which we 
It is the typical performance of typical in- 
In any average there must be as many 


the Norm. 
ideal of satisf 
can hold everybody. 
dividuals at the present time. 
Educators must avoid the compulsion to bring 


below as above. 1 
We must be careful not to try to fit 


everybody "up to the norm. 
everybody into the Procrustean bed of the average. 


186 NORMS AND UNITS FOR MEASUREMENT 


SUMMARY STATEMENT 


A raw score, taken by itself, has no meaning. It gets meaning 
only by comparison with some reference group or groups. The com- 
parison may be with: 


1. A series of age groups (age norms). 

2. A scries of grade groups (grade norms). 

3. A single group, indicating what per cent of that group the score 
surpassed (percentile norms). 

4. A single group, indicating standard deviations from the group 
mean (standard scores). 


Each alternative has certain advantages and certain. limitations, 
which we have considered. 

To get an index of brightness from age norms, quotients such as 
the intelligence. quotient and educational quotient were devised. 
These become meaningful and usable when they have approxi- 
mately the same standard deviation for all age groups. In that 
case, they are essentially standard scores and should be thought of 
as such. 

If the norms available for a number of different tests аге of the 
same kind and are based on comparable groups, all the tests can be 
expressed in comparable terms. Тһеу can then be shown pictorially 
in the form of a profile. Profiles emphasize score differences within 
the individual. When profiles аге used, care must be taken not to 
overinterpret minor ups and downs of the profile. 

Norms represent a descriptive framework for interpreting the score 
of an individual, a class group, or some large aggregation. However, 
before a judgment can be made as to whether an individual or group 


is doing well or poorly, allowance must be made for ability level, 
cultural background, and curricular emphases. The norm is merely 
an average, not a strait jacket into which all can be forced to fit. 


REFERENCES 


1. Boynton, Bernice, The physical growth of girls, Univ. Ja. Stud. Child 
Welf., 1936, 12, No. 4. 
Lennon. R. T., A comparison of results of three intelligence tests, Test 


Service Notebook No. 11, Yonkers, N. W., World Book, 1952. 


QUESTIONS FOR DISCUSSION 187 


SUGGESTED ADDITIONAL READING 


Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ter 4. 

Flanagan, J. C., Units, scores, and norms, Chapter 17 in E. F. Lindquist, 
Editor, Educational measurement, Washington, D. C., American Council on 
Education, 1951. 

Monroe, Walter S., Editor, Encyclopedia. of educational research, rev. ed., 
New York, Macmillan, 1950, pp. 795-802. 

Mosier, Charles I., Batteries and profiles, Chapter 18 in E. F. Lindquist, 
Editor, Educational measurement, Washington, D. C., American Council on 
Education, 1951. 


QUESTIONS FOR DISCUSSION 


1. A pupil in the seventh grade received a raw score of 13 on the Metropoli- 
tan Reading Test, Intermediate Level. What additional information would be 
needed to interpret this score? 

2. Why do standardized tests designed for use with high-school students 
almost never use age or grade norm 

3. What limitations would national norms have for use by a county school 
system in rural West Virginia? What might the local school system do about 


it? 
4. What assumption lies back of the development of age norms? Grade 


norms? Normalized standard scores? 


5. In Fig. 7.7, p. 178. why are the standard scores evenly spaced whereas 
the percentile scores are unevenly spaced? 
6. Using Tables 7.3 and 7.6, briefly characterize the following entering 


sixth-grade children: | . 
Reading Arith. 


ECA. M.A. Score Score 
Pupil A 12.4 10.6 33 9 
Pupil B 10.5 13.2 48 20 
Pupil C 11.3 11.1 31 14 


7. You are а guidance counselor and have given the Differential Aptitude 
Battery to a ninth grade. Using Table 7.4, prepare a summary report and in- 
terpretation for a boy with the following scores: 


Verbal Reasoning 18 Mechanical Reasoning 54 
Numerical Ability 23 Clerical Speed & Acc. 45 
Abstract Reasoning 31 Spelling 14 
Spatial Relations 72 Sentences 22 

8. School A gives a battery of achievement tests each May in each grade 
from the third through the sixth. The median grade level in each subject in 
each teacher's class is reported to the superintendent. Should they be re- 
ported? If so, what else should be included in the report? In what ways 
might a superintendent use the results to advantage? What uses should he 


avoid? 


188 NORMS AND UNITS FOR MEASUREMENT 


9. Miss B prides herself that cach year she has gotten at least 90 per cent 
of her fifth-grade group “up to the norm" in cach subject. How desirable is 
this as an educational objective? What limitations or dangers do you see 
in it? 

10. School С operates on a policy of assigning transfer students to a grade 
on the basis of their grade standing on an achievement battery. Thus, a boy 
with a grade score of 6.4 on the battery as а whole would be assigned to the 
sixth grade, no matter what his age or his grade in his previous school. What 
values do vou see in this plan? What limi 

11. The superintendent of schools in city D noted that school E fell con- 
sistently about a half grade below national norms on an achievement battery. 
He was distressed because this was the lowest of any school in his city. How 
justified is his dissatisfaction? What more do you need to know to answer 
this? 

12. The board of education in city F noted that the second and third grades 
in their community fell substantially below national norms in arithmetic. 
though coming up to the norms in other subjects. They propose to study 
this further. What additional information do they need? 

13. A teacher wished to study gains in reading ability in his class. He gave 
one reading test in October and a parallcl form in February. The reliability 
of the reading test is reported to be 0.90. He found a correlation of 0.85 be- 
tween the October and February testings. What does this suggest concern- 
ing the reliability of his measure of gain in reading ability? (See p. 180.) 
What interpretation can be safely made of his test results? 

14. In what ways is a Stanford-Binet 1.0. the same as a standard score? 
In what ways is it different? 
camine Fig. 7.5. What are the possible advantages of a profile such 
What are its limitations and shortcomings? Is it desirable to plot 
it and use the results? 


itions? 


Chapter & 


Where to Find Information 
about Specific Tests 


THE NATURE OF THE PROBLEM 


The production of educational and psychological tests has been 
going on for only half a century, but during that time literally 
thousands of different tests have been produced. In a comprehensive 
bibliography which covered up to about 1945, Hildreth** included 
entries for 5294 different tests. The number could probably now be 
increased by at least another thousand. Many published tests are 


obsolete now, or only of historical interest, but the number of cur- 


rently available tests is still very great. 
Not only is the total number of tests g 
Tests vary widely in testing procedures, in content, and in group for 
There are paper-and-pencil tests, individual per- 
vales, self-rating procedures, observational 
There are measures of atti- 


at. So also is the variety. 


which designed. 
formance tests, rating 
procedures, and projective techniques. 
of temperament, of personal adjustment, of intel- 


tude, of interest, 
lect, of special aptitudes, and of all aspects of school achievement. 
There are tests designed for infants, for preschool children, for school 
children and adolescents, and for adults. 

No one book can hope to introduce a student to even a representa- 
tive sampling of tests of all types, covering all sorts of content for all 
The following chapters will introduce some of the most 


age levels. 
s, discussing them as exam- 


important and most widely known tes 
But this book cannot give a complete treat- 


roup or subject area, and there are so 


ples of many others 
ment of any particular age g 


many special situations in which 
at the tests discussed here may include 


a reader may be interested or for 


which he may need a test th 
not even one that fits his particular need. 

Since it is impossible to list and evaluate all or even most of the 
tests that might be of concern to an audience with varied interests, 
ach the problem at a different level. We shall try to 

189 


we shall appro 


190 SOURCES OF TEST INFORMATION 


guide the reader to sources in which he can find the available tests 
listed, and in some cases evaluated, and we will try to guide the 
reader in evaluating the tests he locates. The present chapter dis- 
cusses resource materials for finding tests and for finding out about 
them. Chapter 6 has given an orientation in the factors to be con- 
sidered in evaluating the suitability of a particular test for a particu- 
lar purpose. 

The knowledge of where to go to find out about, tests of a particu- 
lar type and how to evaluate one when found is probably more im- 
portant than predigested information about a particular test. Tests 
change and the purposes of the test user change. It is impossible 
to anticipate what type test will be required for some future need. 
The important thing is to know how to go about finding the tests 
available for that need when it arises and how to evaluate their 
relative merits. 

There are several different types of questions about a test or an 
area of measurement for which one may seek answers. Some of the 
types of questions are: 


1. What tests have been developed that might serve my present 
need or purpose? 

2. What are the new tests in my field of interest? 

3. What is test A, of which I have heard, like? For what groups 
and purposes was it designed? Who made it? How long does it 
take and how much does it cost? What skills are needed to give and 
use it? 

4. What do specialists in the field of measurement have to say 
about test A? How do they evaluate it, in comparison with com- 
peting techniques? 

5. What basic factual material do we have on test A? What are 
its statistical attributes? What аге its relationships to other meas- 
ures? 

6. What rescarch has been done studying or using test A? 


Let us see what materials are available to us as we try to answer 
auestions such as these. These resources include (1) textbooks in 
special areas, (2) comprehensive bibliographies, (3) the Mental 
Measurements. Yearbooks, (4) test reviews in professional journals, 
(5) publishers' catalogues, (6) each test itself together with its accom- 
panying manual, (7) journal articles reviewing a broad field of test- 
ing, and (8) abstract and index series. These will be considered in 
turn, the most useful items will be identified, and the information 
to be obtained from each tvpe of source will be indicated. 


TEXTBOOKS IN SPECIAL AREAS 191 


TEXTBOOKS IN SPECIAL AREAS 


There are a number of textbooks in more specialized areas of test- 
ing. When the scope is limited to include only elementary-school 
tests, tests for diagnosis of individual maladjustment, or tests for 
vocational placement, it becomes possible to cover the field in more 
detail. A book dealing with tests of a particular type provides a good 
general introduction to the materials of the field. Such a textbook 
usually acquaints the reader with a representative selection of estab- 
lished tests in the area—those which the author considers worthy of 
mention. In addition, some evaluation of each test is usually given, 
indicating the purposes for which it may well be used, and what the 
writer considers to be its strengths, weaknesses, and distinctive char- 
acteristics. The book will usually also contain some discussion of 
the problems of testing in the field it covers, apart from discussion of 
specific tests. 

It is not possible to consider all the textbooks that might prove 
useful to some reader. However, a number of them have been listed 
below with brief annotations. The titles have been chosen in terms 
of their recency and the quality of their treatment. In addition, an 
attempt has been made to get textbooks that represent a wide range 
of specialized interests. The annotations are designed to bring out 
the distinctive quality of each book. 

Anastasi, Anne, Psychological testing. New York, The Macmillan Company, 
1954. Disc s general principles of psychological testing, individual and 
group intelligence tests, special aptitude tests, and personality measures. 

Arny, Clara Brown, Evaluation in home economics, New York, Appleton- 
Century-Crofts, Inc., 1953. Although the examples given in the first half 
are related to home economics, the excellent dis ion of purposes and meth- 
ods of evaluating student progress are applicable to any class. Commerically 
, check lists, and rating scales are described and 


published standardized tes 


uses indicated in an appendi: is 
Bennett, George K., and Ruth M. Cruikshank, A summary of clerical tests, 


New York, Psychological Corporation, 1949. Discusses the general problems 
of selecting personnel for the clerical occupations. Provides information about 


tests then available for this purpose. | 

Bennett, George K., and Ruth M. Cruikshank, A summary of manual and 
mechanical ability tests, New York, Psychological Corporation, 1942. Provides 
a listing of manual and mechanical a ility tests available at the time, with a 
full annotation providing relevant information on each. 

Blair, Glenn M., Diagnostic and remedial teaching in secondary schools, New 
York, The Macmillan Company, 1946. Lists, and in some cases describes, 
tests that would be useful in connection with remedial programs in reading. 
spelling, handwriting, and the fundamentals of English in the secondary 


school. 


192 SOURCES OF TEST INFORMATION 


Clarke, Harrison H., Application of measurement to health and physical edu- 
cation, New York, Prentice-Hall, Ine., 1950. Describes a variety of perform- 
s for knowledge 
of sports techniques and health education, and rating scale techniques. Em- 
phasizes use of tests and planning an efficient program for evaluating students 
in physical education. 

Ferguson, Leonard W.. Personality measurement, New York, McGraw-Hill 
Book Company, Inc., 1052. Discusses methods used in personality measurc- 
ment, describing representative tests and devices used and evaluating the 
adequacy of different techniques. 

Froelich, Clifford P.. and A. L. Henson. Guidance testing, Chicago, Science 
Research Associates, 1948. Describes and evaluates tests of intelligence, 
achievement, interest, personality, and special aptitudes likely to be useful in 
a high-school program of educational and vocational guidance. 

Greene, H. X., А. N. Jorgensen, and J. R. Gerberich, Measurement and 
evaluation in the elementary school, New York, Longmans, Green & Company, 
1953. Selected chapters provide descriptions of tests suitable for use in the 
clementary school. 

Greene, H. X., A. N. Jorgensen, and J. R. Gerberich, Measurement and 
evaluation in the secondary school, New York, Longmans, Green & Company. 
1954. Selected chapters provide descriptions of tests suitable for use in the 
secondary school. 

Hardaway, Mathilde, and Thomas Maier, Tests and measurements in busi 
ness education, 2nd ed., Cincinnati, South-Western Publishing Company, 1952. 
Provides lists of achievement and prognostic tests available in business edu- 
cation. 

Hay samuel P., Vocational aptitude tests for the blind, Perkins Institution 
and Massachusetts School for the Blind, Watertown, Ma 1946. Describes 
scholastic aptitude, mechanical aptitude, and other tests that have been 
adapted for the blind, giving references to studies in which these tests have 
been used. 

Hildreth, Gertrude, Learning the three R's, 2nd ed., Minneapolis, Educa- 
tional Publishers, 1947. Lists and discusses tests for surveying and diagnos- 
ing the achievement in reading, spelling, writing, and arithmetic, primarily in 
the elementary school. 

Lawshe, Charles H., Jr., Principles of personnel testing, New York, Me 
Graw-Hill Book Company, Ine., 1948. Describes tests useful in a personnel 
selection program and discusses the manner of their use. 

Super, Donald E., Appraising vocational fitness, New York, Harper and 
Brothers, 1949. Reports on selected tests in a wide variety of fields that may 
be used in educational or vocational guidance. 


ance tests for physical fitness and skills, paper-and-pencil tes 


4 


One limitation of textbooks, such as those just annotated, becomes 
apparent from an examination of the publication dates. At the 
time that these were selected (1954), cach was judged to be the most 
recent good book in its field and yet some were already eight years 
oll. When one adds to this the time that has elapsed in the prepara- 
tion and printing of the book, it is casy to sce that a textbook cannot 
be relied upon for current materials. The typical textbook gives in- 


COMPREHENSIVE BIBLIOGRAPHIES OF TESTS 193 


formation about well-established and accepted tests, but recently 
published devices or techniques that are still in the experimental 
stages are not likely to be represented. There is a lag of several 
years between production of a device and the reporting of it in text- 
book materials. 

Another feature of textbooks, which may be in some cases an ad- 
vantage and in others a disadvantage, is that they are selective. 
They must be. The author cannot discuss everything, so he must 
pick the items he wishes to present. He selects for discussion the 
tests which he considers valuable. In so far as his judgment is sound, 
he does a real service to the novice in the field, who is thus led di- 
rectly to the more important and valuable material. However, this 
means that the reader cannot expect to use a textbook as a source 
to lead him to all the tests in an area and permit him to compare 
them. For a full listing of the tests of any particular type he will 


have to look elsewhere. 


COMPREHENSIVE BIBLIOGRAPHIES OF TESTS 


There are available in the broad area of testing two comprehensive 
bibliographies with which every serious user and student of tests 
should be familiar. These are particularly useful for anyone who 
wishes to build up a complete list of the tests of a particular type or 
in a particular area, since they undertook to list a// the tests which 
had been produced up to their date of publication. 

The bibliography that will probably most often be useful is the 
one prepared by Hildreth.** Together with its supplement, this in- 
cludes 5294 entries, and covers tests produced up to about 1945. 
The listing is by topics. An attempt is made to list all tests and to 
indicate where each is published or a source where it is described. 
No further information is given about the tests, and no evaluation 
of them is attempted. The bibliography is all-inclusive, and much is 
included with the currently valuable material that is at most only of 
historical interest. This reference is, then, a useful aid in finding 
nearly all of the tests in an area up to 1945, but beyond locating a 
particular test for the reader it provides no help. 

s bibliography " was published in 1939 and 1940 and has 


Wang ч 
not been revised since. It is, therefore, quite out of date. However, 


it does include a fairly extensive annotation on each test that is 


listed. The annotation typically gives author and publisher, cost of 
the test (at that time), time for administration, age or grade groups 


for which designed, a description of test content, and some informa- 


194 SOURCES OF TEST INFORMATION 


tion with regard to reliability and other statistical attributes of the 
test. It has some value, therefore, in providing a rather complete 
listing of the older tests together with information about cach. 


THE MENTAL MEASUREMENTS YEARBOOKS 


Probably the most useful single reference source for the person 
needing to make choices and plan programs in the field of testing is 
the series of Mental Measurements Yearbooks prepared by Buros b 
Four Yearbooks have now been published, and they were preceded 
by several more modest volumes of the same type. The Yearbooks 
undertake to provide a listing and a frank and critical review of 


each new standardized test that is published. 

A large panel of reviewers has cooperated in the preparation of 
these volumes, each reviewer evaluating two or three tests in an 
area in which he is presumed to be competent. The tests of more 
general interest are appraised by two and sometimes even more re 
viewers. The reviews are fairly full, pointing out strengths and weak- 
nesses of the test, comparing it with others in the field, and indicat- 


ing the purposes for which the reviewer considers it uscful. 

In addition to reviews of a test, the Yearbooks also include the fac- 
tual items about the test that a potential user is likely to need such 
items as author, publisher, publication date, cost, time to administer, 
grades for which suitable, number of forms available, and the like. 
Finally, for cach test the Yearbooks give a bibliography of books and 
articles that have appeared dealing with that particular test. These 
bibliographies are quite extensive, amounting in the case of one test 
to over a thousand titles. 

The Yearbooks have two other features that add to their value to 
the test user. One is a section on books and monographs related to 
measurement problems. This section undertakes to list all the books 
on measurement for the period covered and in addition gives excerpts 
from the reviews of these books that have appeared in psychological 
and educational journals. The bibliography and reviews provide a 
guide to, and evaluation of, publications in the field. 

Also valuable is a very complete index and directory section. This 
includes (1) a directory and index of the publishers of the tests and 
of the books on measurement reviewed in the volume, (2) a directory 


iews of tests or 
and tests, (4) an 
index cf names occurring in any connection, and (5) a classified index 


and index of the periodicals that-have included r 


books on testing, (3) an index of titles of book 


of tests organized by content or type. These indices make it possible 


JOURNAL TEST REVIEWS 195 


to locate any test or type of test, to locate the complete original of 
any excerpted test review, and to get in touch with the publisher of 
any test. 

When a question arises about a test or a type of test, the Mental 
Measurements Yearbooks are the volumes for which one reaches al- 
most automatically. They are a “must” for any individual or any 
office that must answer frequent questions about tests or testing. 

The Yearbooks are not too convenient to use if one wishes to cover 
early as well as current tests in a particular area. At the present 
writing, there are four of them, published in 1938, 1941, 1949, and 
1953. (The large gap between the second and third was caused, of 
course, by the intrusion of World War II.) To cover the tests in 
any field, the reader must search all four volumes and may in fact 
need to go back to antecedent publications." 

A new test will ordinarily be reviewed in the first Yearbook that 
came out after it was published, or reviews may appear in more 
than one volume. However, space limitations did not permit review 
in the 1938 Yearbook of all the older tests that were thought to merit 
review, and reviews of some of these have been included in the later 
volumes. Even the set of volumes taken together does not under- 
take to be exhaustive in its coverage of tests of a given type. How- 
ever, if he brings together the material in the complete series, the 
reader will probably find an appraisal of any test that he is likely 
to consider using, published up to the time that planning for the last 
Yearbook was completed. The first two Yearbooks cover tests up to 
about 1939; the third covers the period from 1940 through 1947; the 
fourth deals with material from 1948 through 1951. 


JOURNAL TEST REVIEWS 


We still face the problem of getting information on the latest tests 
and testing developments. One source for reviews of important new 
tests, particularly those likely to prove useful in individual consult- 
ing and clinical work, is the Journal of Consulting Psychology. This 
journal has a regular test- and book-review section, and in cach issue 
one finds reviews by the editorial staff of one or more new tests. The 
copies of this journal appearing since the closing date of the latest 
Mental Measurements Yearbook supply the largest amount of critical 
material on recent tests. A listing of recent tests will be found ap- 
proximately once a year in the journal Educational and Psychological 


Measurement. 


196 SOURCES OF TEST INFORMATION 


TEST PUBLISHERS 


The most up-to-date information on what tests are available is 
selves, either 


probably to be obtained from the test publishers then 
through correspondence or through their catalogues. There are many 
publishers, too many to list here, so that gathering information from 
all of them would һе quite an undertaking. However, the number 
who publish extensively in the testing field is a good deal more 
limited. A number of the most important publishers are listed in 
Appendix IV together with their addresses and some indication of 
the types of material and the services they supply. 

The limitations of a test publisher as an entirely unbiased source 
of information on the values and limitations of his own publications 
are, of course, obvious. Reversing Mark Antony, we may say he 
comes to praise his tests, not to bury them. However, as а source 
of information about, rather than evaluation of his tests, he can be 
very helpful. In Chapter 6 we have considered how the potential 
user may go about appraising а new test for himself in the light of 
the information he can get from the test producer and from other 
sources. 


TEST AND MANUAL 


The individual who is seriously considering using a particular test 
will certainly need to examine the test itself and the manual the pub- 
lisher has prepared to go with it. Each publisher's catalogue will in- 
dicate the price for which a specimen set of each test may be ob- 
tained. The specimen set contains a copy of the test itself, the in- 
structions for administering and scoring, and part or all of the sub- 
plementary materials available to the user to help in interpreting the 
test. 

The amount of supplementary materials included in a specimen 
set varies from one publisher to another. The potential user can 
legitimately expect the publisher to include materials in a specimen 
set that will provide all the information he needs in order to arrive 
at a decision as to the suitability of the test for his purposes. He 
should be skeptical of any test for which the information supplied 
him is incomplete. The individual who wishes to examine а number 
of different tests without buying specimen sets of cach may be able 
to find a test file in the library or the guidance department of his 
local university. 


To obtain specimen sets of tests, the applicant must ordinarily 
present some sort of credentials. A letter on the official letterhead 


JOURNAL REVIEW ARTICLES 197 


of his school or institution will often suffice. A note from the uni- 
versity where he is studying may serve the function. The limita- 
tions that publishers place upon the distribution of their materials 
depend upon the nature of the materials. They will often refuse to 
distribute tests that require special skills to administer and interpret 
unless the applicant can give evidence that he has the training and 
skills to qualify him to use the materials. 

A^ detailed examination of the test itself will provide the potential 
user with a basis for judging how well the content of the test and 
the form of test exercises correspond to the objectives and functions 


he wishes to mcasure. The accompanying material, which we have 
collectively called the test manual, is a very important part of any 
test. It varies enormously in quality and comprehensiveness from 
one test to another. In some of the better current tests, this collateral 
material becomes almost a book. It provides a great variety of im- 
portant information to help in using and interpreting a test. We 
have indicated in Chapter 6 (pp. 144-146) the types of information 
a test user has a right to expect to find in the test manual. A manual 
that provides all this information becomes a very important source 
for information about the test. 

Manuals differ greatly not only in comprehensiveness but also in 
impartiality and integrity. Probably no test manual is entirely free 
of a promotional element. However. sometimes the manual becomes 
to a very large extent a promotional device focused on increasing the 
sales of the test. The potential user must always be aware of this 
aspect of the manual and must endeavor to discount appropriately 
claims made for the test. There often appears to be an inverse rela- 
tionship between the grandeur of the claims that are made and the 
evidence on which they are based. The reader will do well to keep 
his attention focused on {һе evidence presented in the manual, to 
view claims in the light of this evidence, and to be extremely sus- 
picious of the test whose manual makes sweeping claims but presents 


very little data. 


JOURNAL REVIEW ARTICLES 


It is sometimes useful to refer to summary articles covering recent 
developments in tests and testing. The most regular of these in re- 
cent years has been the triennial summary in the Review of Educa- 
tional Research. This journal undertakes to summarize research in a 
number of different sectors of education. Its publication schedule is 
arranged so that a given area is treated every 3 years. Material on 


198 SOURCES OF TEST INFORMATION 


tests and measurements was reviewed in the February, 1953, issue, 
which was devoted to educational and psychological testing. Similar 
reviews appeared in 1950, 1947, and every third year back to 1932. 
Because of the volume of material to be covered, these reviews are 
very condensed, but they do introduce the reader to new tests and 
testing rescarch and provide him with a bibliography of original ref- 
erences to which he can go for a fuller report on any topic in which 
he is interested. 

A recently initiated review enterprise, which may prove uscful, is 
the Annual Review of Psychology. The first volume of this publica- 
tion appeared in 1950, and a volume has appeared cach year since 
then. The publication is designed to provide, with a minimum of 
delay, reviews of significant psychological developments. Material 
relating to measurement problems may be found in chapters dealing 
with individual differences and with diagnosis in counseling and in 
the clinic. 

The field of psychological tests used to be reviewed periodically in 
the Psychological Bulletin. Reviews of interest dealing with special 
measurement problems will still occasionally be found in that jour- 
nal, though regularly repeated reviews in the field are not brought 
out at the present. 


ABSTRACTS AND INDICES 


Two final sources that must be brought to the attention of the 
serious student are the Psychological Abstracts and the Education 
Index. These are basic bibliographic sources in the fields of psy- 
chology and education respectively. Each undertakes to provide a 
complete listing of current publications in its respective field. The 
field for the Psychological Abstracts is rather more narrowly defined, 
being restricted to scientific and technical publications in psy chology: 
Each publication is represented not merely by title but also by an 
abstract indicating the nature of the report and the major findings. 
An annual subject index and author index aid in locating desired 
material. The Psychological Abstracts provides a monthly listing of 
new tests. This appears in the “General” section at the beginning 
of each issue under the heading New Tests. The Abstracts also covers 
the literature of research using tests and of findings with respect to 
them. 

The Education Index covers a considerably wider range of material, 


since it deals with the whole broad arca of education and includes 
popular and professional materials as well as those of a more tech- 


SUMMARY STATEMENT 199 


nical and scientific nature. It gives references only, providing no 
information about the nature and content of the item. Material is 
topically organized, and the user who looks under such topics as 
ability tests, educational measurement, mental tests, or personality 
tests will find most of the material relating to measurement in 
education. 

The joint use 
Index, supplemente 
should enable the student who wishes to dig to the roots of a meas- 
urement problem to locate the bulk of the work that has been done 
on that problem. А useful supplementary source for the earlier re- 
search literature of the topic is South's classified bibliography,’ 
which was published in 1937 and gives access to much of the research 


of the Psychological Abstracts and the Education 
d by the other sources discussed previously, 


literature prior to that date. 


SUMMARY STATEMENT 


chapter a number of questions were sug- 
gested to which a test user might wish answers. The important 
sources of information about tests and testing have now been dis- 
we may try to relate the sources to 


At the beginning of this 


cussed. By way of summary, 
An attempt has been made to do this in Fig. 8.1. 


listed various questions one might raise 
testing problem. On the side are listed 


the questions. 
At the top of this chart are 


about a test, type of test, or 
source material referred to in this chapter. In 


the various types of : 
present the extent to which the source 


cach cell is a symbol to re 1 xt 
should help in answering the question. The symbol ** is used to 
designate one of the sources that would probably be most helpful 
and to which one would turn first. Sources marked * are ones that 
ed to contribute to the needed answer. Sources 
marked ? are ones that might perhaps provide some useful informa- 
tion. Where there is по entry at all, the source is not likely to be 
helpful in that connection. A critical study of this table, with 
analysis of the reasons for the various entries, should leave the reader 
well prepared to go out and get for himself the information he needs 
st or as background for a specific testing prob- 


would also be expect 


in order to select a te 
lem. 


200 SOURCES OF TEST INFORMATION 


To Find Out, in Any Field 
| 
What re- 
& What What search has 
es j 3 ч na 
scl What What What |. ciatis facts been done 
tests new tests test X б? k E we have |on or with 
Я ink ol E 
there are | there are is like ч about test X or 
test X " : 
test X testing 
problem Y 
in special areas of * * ? * * 
measurement 
Hildreth's bibliography +. 
Wang's bibliography Ы 2 ? 
Mental. Measurements * 2 Ы -—- " ee 
Yearbooks 
— اا ڪڪ‎ — — 
Reviews in Journal of М > " 
Consulting Psychology 
New tests section in Edu- 
cational and Psychologi * 
cal Measurement 
Publishers catalogues # ** * 
st blank and manual ** жж ж 
articles in Review 2, ? x 5s 
of Educational Research ? 
South's classified bibliog- " 
raphy 
Psychological Abstracts * * ** 
Education Index * 


Key: * Most helpful. * Somewhat helpful. ?— Possibly helpful. 


Fig. 8.1. Appraisal of sources of information abcut tests and testing. 


REFERENCES 


1. Buros, O. K., Educational. psychological, and personality tests of 1933. 
1934, and 1935, Rutgers Univ. Bull., Vol. 13, No. 1, Studies in Education, Хо. 
9, New Brunswick, N. J., School of Educ., Rutgers University, 1936. 

2. Buros, O. K., Educational. psychological, and personality tests of 1936. 
Rutgers Univ. Bull., Vol. 14, No. 2A, Studies in Education, No. 11, New 
Brunswick, N. J., School of Educ., Rutgers University, 1937. 

3. Buros, О. K., The 1938 mental measurements yearbook, New Brunswick, 
N. J., Rutgers University Press, 1938. 


QUESTIONS FOR DISCUSSION 201 


4. Buros, O. K., The 1940 mental. measurements yearbook, Highland Park. 
. The Mental Measurements Yearbook, 1941. 
Buros, O. K.. The third mental measurements yearbook, New Brunswick, 
J. J., Rutgers University Pres 1949, 
‚ О. K., The fourth mental measurements yearbook, Highland Park, 
N. J., Gryphon Pri 1953. 

7. Hildreth, Gertrude Н., A bibliography of mental tests and rating scales, 


New York, Psychological Corp., 1939. 
8. Hildreth, Gertrude H., <1 bibliography of mental tests and rating scales. 
1045 supplement, New Yo Psychological Corp., 1946. 
9. South, E. B., An index of periodical literature on testing, New York, Psy- 
chological Corp., 1937. 
10. Wang. C. K. X. ln annotated bibliography of mental tests and scales, 
Peiping, China, Catholic University Press, 1939, 1940, 2 vols. 


A 


6. Buros 


QUESTIONS FOR DISCUSSION 

ed in the text, prepare as complete a list as 
specific grade and purpose (i.e.. tests in 
in American history for the 


1. Using the sources indicat 
you can of standardized tests lor 
first-year Spanish, reading readiness tests. 


tes 


twelfth grade, etc.). 
2. Using the Fourth 

think of a particular test th 
3. Using the Fourth Menta 

have to say about one of the 


Mental Measurements Yearbook, find out what reviewers 
at you are interested in. 

1 Measurements Yearbook, find out what reviewers 
following titles that interests you: 


. Dimensions of Personality 


enck, Н. J 
ig Techniques in College and Secondary School, Re- 


ng, Ruth, Counsellir 
sed Edition yee 
Lawshe, C. H., Ir. Principles of Personnel Testing 
Goodenough, Florence L., Mental Testing 


to answer each of the following 


4. To what sources would you go to try 
What would you expect to get 


questions? To which would you go first: 
from each? 


hould I use to study the progress of two class groups in be- 


a. What tes 

ginning H 
b. What kinds of norms 
c. Is the Rorschach Test of any value as 


are available for the Stanford Achievement Tests? 
a predictor of academic success in 


college? 


d. Has a new revision of the Wechsler Intelligence Scale been published vet? 


e. What intelligence tests have been developed for use with the blind? 
f. What are the significant differences between the Metropolitan Achieve- 
ment Tests and the California Achievement Tests? 
How much does the Otis Quick-Scoring Intelligence Tes Belli. cost? 
h. What do testing people think of the Brainard Occupational Preference 
Inventory? 


т 


Chapter 9 


Standardized Tests of 
Intelligence or Scholastic 
Aptitude 


ACHIEVEMENT AND APTITUDE 


Ability tests are designed to appraise what an individual сап do 
under favorable conditions when he is trying to do his best. Any 
ability test measures performance at the time of testing. From this 
performance we may hope to make one or more of a variety of dif- 
ferent inferences. We may want to infer how effective a program of 
school instruction has been in teaching new knowledges or skills, 1.6. 
how much progress the pupils have made in some kind of achieve- 
ment. We may want to infer how well cach individual will do in 
learning some new task, i.e., a prognosis of future achievement. We 
may want to make inferences about the organization or structure of 
human abilities, i.e., what goes with what. We may hope to unravel 
the causal factors in individual abilities or disabilities, i.c., why the 
individual fails or succeeds with a particular task. All these are dif- 
ferent sorts of inferences. The basic evidence in every case is per- 
formance on a set of test tasks. 


Performances are tied with varying degrees of closeness to specific, 
organized instruction. At one end of the scale are those knowledges 
and skills that are the direct outcome of organized teaching, usually 
in schools but sometimes on the job. To decipher the meaning of 


"Arma virumque cano" or of 
6 ne g ao т * X 


are accomplishments that will be developed almost € 
high-school course in Latin, on the one hand, or Gre 


cclusively in а 
g shorthand, 


on the other. Even the ablest individual with a wealth of general 
life experiences is unlikely to acquire abilities such as these unless 
202 


ACHIEVEMENT AND APTITUDE 203 


they have been specifically taught. We will frequently want to 
measure the extent to which abilities such as these, dependent di- 
rectly upon formal instruction, have been acquired. Tests thus tied 
to instruction and concerned with evaluation of past progress are 
spoken of as achievement tests or proficiency tests. We shall consider 
them in some detail in Chapter 11. 

At the other end of the scale are abilities that are developed through 
the general experiences of life, quite apart from any formal instruc- 


tion. Consider the two pictures in Fig. 9.1. Suppose we were to 


Fig. 9.1A and B. Picture-absurdity-test items. 


ask a child, concerning cach of them: What is wrong with this pic- 
ture? What is silly about it? As we went up the age range, we 
would find more and more children who could give us a satisfactory 
answer. But probably no child would have been specifically taught 
in school that shadows extend away from the sun or that in a wind 
flags and smoke will be blowing in the same direction. The back- 
ground to apprehend the absurdity of these situations and the ma- 
turity to isolate the critical elements in the pictures come from the 
general experiences of growing up in our society. 1 

It should be emphasized that any performance depends in some 
degree upon experience. A child from a culture that had provided 
no experience with books and pictures would be less likely to suc- 
ceed with the tasks of Fig. 
as corresponding to re 
xperience with chimneys or with flags would 


9.1 because he had never learned to inter- 

pret a picture al things in a real world. А 
г 

child who had had no с 


204 TESTS OF INTELLIGENCE 


be severely handicapped on picture B since he would not be able to 
interpret the picture or know how these things should behave. The 
two absurdities items assume (1) a general familiarity with pictures 
and the representation of things by pictures and (2) experience with 


trees, shadows, houses, flags, smoke, and wind. The normal child in 


a normal American environment will have һай these experiences in 
abundance. For him, therefore, the test provides a measure of per- 
ception, analysis, and understanding of his environment. Differ- 
ences between individuals in performance on these tasks may then 
reasonab! ; be expected to reflect fairly basic differences in certain 
aspects of intellectual ability. 

The examples we have given have illustrated two points quite far 
apart on the scale ranging from "directly taught" to "acquired. en- 
tirely from general life experience 


Man abilities fall at inter- 
mediate points along this scale. The meanings of words, for exam- 
ple, are taught in school in connection with almost every segment of 
the school program. But a very large part of our stock of word 
meanings is picked up in the reading and listening done out of school 
as an incidental by-product of just living in our society. Again. 
reading is usually first learned in school, but a large part of the 
growth in fluency of reading and depth of understanding of printed 
matter comes from out-of-school reading and from the general ac- 
quiring of experience and maturity as a part of growing up. There 
is no clear boundary line marking off the ability that is a school 
achievement from the one that is not. j 

Psychologists and educators are interested in measuring the under- 
lying aptitudes of human beings. The interest is sometimes in using 
these aptitude measures to predict later achievements. [t is some- 
times in studying the aptitudes for their own sake. But the concept 
of aptitude is a tricky onc. Aptitude implies some natural or innate 
capacity for a particular type of performance-—scholastic aptitude, 
mechanical aptitude, or artistic aptitude. But all we can observe is 
performance on a set of tasks. As stated above, this performance 
inevitably depends in some measure upon the experiences that the 
individual has had. If we want to get at basic individual differences 
in capacity to do a certain type of task, our only hope is to seek for 
test items based on experiences so common and general in our cul- 
ture that almost every person will have had the requisite experiences. 
We must build upon the common core of experience available to all. 
This is what aptitude tests aspire to do. They try to base their 
items upon experiences, mostly out of school but overlapping to 
some extent those provided in school, that are uniformly provided 


MEASUREMENT OF ABSTRACT INTELLIGENCE 205 


for individuals growing up in our society. They use these present 
abilities, based inevitably on a variety of past learnings, as indica- 
tors of what the individual can learn to do in the future. 

The difference between aptitude measures and achievement measures 
is, then, one of degree and emphasis. Any test of ability is to some 
extent an aptitude test and to some extent an achievement test. 
The difference between the two designations is perhaps more in the 
type of inference that we want to make than in the specific content 
or in the “innateness” of the measure. A test is to be thought of as 
an achievement test when we wish to draw conclusions about past 
progress and as an aptitude test when we wish to estimate future 
potentialities. The remainder of the present chapter will be devoted 
to tests of general intellectual ability, or scholastic aptitude. Chap- 
ter 10 will be concerned with other types of special abilities, and 
Chapter 11 will be devoted to standardized tests of educational 


achievement. 


TASKS USED TO MEASURE ABSTRACT INTELLIGENCE 


Much of the research and development of aptitude measures has 
been devoted to devising and studying tests of "general intelligence,” 
familiarly known as J. O. tests.“ General intelligence, in this con- 
text, has typically meant abstract intelligence—the ability to see re- 
lations in, make generalizations from, and relate and organize ideas 
represented in symbolic form. What general intelligence has meant 
to those who have tried to test it can be seen from the types of tasks 
Examples of a number of the common types of 


they have used. 7 
The keyed answers are underlined. 


tasks are given below. 


VOCABULARY 
A word meaning nearly the same as robust is 


A. cheerful. D. strong. C. fat. D. small. E. wealthy. 
VERBAL ANALOGIES 
Branch is to tree as brook is to 


C. bank. D. river. E. babble. 


А. water. B. root. 


SENTENCE COMPLETION 
— . and sets in the west. 


D. end. E. sky. 


The sun rises in the 


A. summer. B. morning. C. east. 


206 TESTS OF INTELLIGENCE 


ARITHMETIC REASONING 


A boy bought candy bars at 90 cents for a box of 24 and sold them at 5 
cents each. How much did he make on each bar? 


A. 30 cents. B. 334 cents. C. 114 cents. D. 45 cents. 


E. None of these. 
NUMBER SERIES 


What number should come next to continue the series 1 2 4 7 11? 


А. 14. В. 15. Ce 16, D. 18. E. 22. 


FIGURE ANALOGIES 


Ша Ara лп 


Fig. 9.1C. 
CLASSIFICATION 
Look at the three words on the left. Which word on the right belongs with 


these three? 


Doctor. Lawyer. Engineer. Farmer. Architect. Mechanic. 
Salesman. Laborer. 
"MULTIMENTAL" 


Which one of the figures does not belong with the other four? 


p X DT 


Fig. 9.1D. 
PICTURE ARRANGEMENT 


The pictures below tell a story. Which picture comes first in the story? 


er aay 
25015, 4 


MEASUREMENT OF ABSTRACT INTELLIGENCE 207 


COMPREHENSION (COMMON SENSE) 


What is the thing to do if you bump into someone and hurt him? 


SIMILARITIES 


In what way are wool and cotton alike? 


INFORMATION 


What month in the year has the fewest days? 


DIGIT SPAN 
“1 will say some numbers. Listen carefully, and when I am through re- 
peat exactly what I said. Listen— 


ХЕТ 4 


[2 


Now repeat what I said." 


DIGIT SYMBOL SUBSTITUTION 
This is a code test. Each figure stands for a particular number. You are 
to put the right numbers in the boxes as fast as you can. 


Code 


Test 


A 8 8 X A 8 etc. 


Fig. 9.1F. 


208 TESTS OF INTELLIGENCE 


OBJECT ASSEMBLY 


These pieces, if put together correctly, will make a boy. Go ahead and 
put them together. 


Fig. 9.1G. Object Assembly Test item from Wechsler Intelligence Scale for Children. 
duced by permission of the Psychological Corporation.) 


(Repro- 


GROUP INTELLIGENCE TESTS 


Most of the intelligence testing carried on in this country is done 
with group tests. "These аге paper-and-pencil tests much like the 
objective {уре of school examination. They usually consist of 75 to 
100 multiple-choice items of the types illustrated in the previous 
section. Ordinarily, the examinee must read the problem to him- 
self, must work ahead and do the tasks one after another, and must 
do as many as he can within a fixed time limit. However, some 
group tests call for oral instructions from the examiner, and some 


are paced by the rate at which the examiner presents the test tasks. 


REVISED STANFORD-BINET TESTS 209 


Some group intelligence tests (c.g., California, Kuhlmann-Ander- 
son, Lorge-Thorndike, Pintner) are made up of several separately 
timed subtests, in cach of which all the items follow the same pat- 
tern; ie, all are vocabulary items or all are number series items. 
Others (e.g., ZIenmon- Nelson, Otis) have the different types of items 
mixed in together, a vocabulary item being followed by a number- 
series item, that by a figure-analogies item, etc. The cycle of dif- 
ated, the items gradually becoming more 


ferent types of items is repe 
difficult. This type of test is called a "spiral omnibus" test because 
of the cyclical pattern. 
The typical group test is designed to cover a range of three or four 
school grades, i.e., 4 to 6, 7 to 9, 10 to 13. Tests for elementary- 
school children usually call for responses marked in the test booklet 
itself, but many of the tests for older groups use separate answer 
sheets that can be machine scored. 
of different series of group tests on the market 
The number is too great to per- 
Several are listed, together with 
some evaluation of each, in 


There are a number 
that are quite satisfactory to use. 
mit discussion of cach one here. 
annotations describing and providing 
Appendix III. 

In the remainder of this chapter 
iven to one examinee at a time in a face- 
most widely used in the United 


we will first describe the two in- 


dividual tests, i.e., tests g 
to-face setting, that are currently 
States. These are the Revised Stanford-Binet Tests of Intelligence and 


the Wechsler-Bellevue Intelligence Scales. Next, we will discuss some 
of the special types of intelligence measures—tests avoiding reading 
and language, tests for the very young, 
from cultural bias Then we will compare group and individual 
tests, considering the advantages of each. Finally, the remainder of 
the chapter will be concerned. with evaluation, interpretation, and 


tests designed to be free 


5. 


use of intelligence test results. 


THE REVISED STANFORD-BINET TESTS ОР 


INTELLIGENCE 

The individual test that has had the widest use with school-age 
children is the Stanford-Binet, developed by Lewis M. Terman. A 
he test was published in 1937 by Terman and 
Merrill? and this is the one that is currently used. The Revised 
Stanford-Binet provides a set of tests for each of 20 levels of ability. 
It starts with tests suitable for the average 2-year-old and ends with 
four levels suitable for differentiating the abilities of average and 


revised version of the 


210 TESTS OF INTELLIGENCE 


superior adults. To illustrate the content of the test, we have picked 
four levels at different points on the scale and listed the tests of cach 
level with brief descriptions. 


TWO-AND-A-HALF-YEAR LEVEL 


1. Identifying Objects by Use. (Card with 6 small objects attached.) 
"Show me the one that we drink out of." etc. 
Three out of 6 for credit at this level.* 
- Identifying Parts of Body. (Large paper doll.) 
"Show me the dolly's hair." etc. 
Four out of 4 for credit at this level.“ 
3. Naming Objects. (Five small objects.) 
"What is this (Chair, automobile, etc.) 
Four out of 5 for credit. 
4. Picture Vocabulary. (Eighteen small cards with pictures of common 
objects.) 
"What's this? What do you call it?” 
Nine out of 18 for credit at this level.* 
5. Repeating Two Digits. 
“Listen; say 2." “Now, say 4, 7." etc. 
One out of 3 for credit. 
6. Three-ITole Form Board, Rotated. 
angle cut out.) 
Blocks taken out, board rotated, child told, “Р 
they belong." 
One out of 2 tries for credit. 


кә 


(Board with square, circle, and tri- 


ut them all back where 


SIX-YEAR LEVEL 


1, Vocabulary. (Graded list of 45 words.) 

"When I say a word, you tell me what it те; 
еїс. 
Five words correct to receive credit at this le 

2. Copying а Bead Chain from Memory. 
Examiner makes 7-bead chain 

and child copies. 

3. Mutilated Pictures. (Five cards of objects wi 
"What is gone in this picture?“ 
Four out of 5 for credit. 

4. Number Concepts. (Twelve 1-inch cubes.) 

"Give me 3 blocks. Put them here.” 
Three out of 4 different numbers correct. 

. Pictorial Likenesses and Differences. (Six cards with se 
Put your finger on the one that is not the same as the 
Five out of 6 for credit at this level. 

6. Maze Tracing. (Mazes, with start and finish points marked.) 

"The little boy wants to go to school the shortest w 
off the sidewalk. Show me the shortest way." 
Two right out of 3 for credit. 


ans. What is an orange?" 


vel. Words like tap, gown. 
(Wood kindergarten beads.) 
‚ Shows to child for 5 seconds, removes 


th part missing.) 
or " What part is gone? " 


n 


ts of figures.) 
others.“ 


ay without getting 


* Scored also at one or more other levels. 


REVISED STANFORD-BINET TESTS 211 


TWELVE-YEAR LEVEL 
1. Vocabulary. (Same as 6-year level.) 
Fourteen words correct for credit at this level. Words like juggler and 
brunette. 
2. Verbal Absurdities. (Five statements.) 
“Bill Jones’ feet are so big that he has to pull his trousers on over his 
head. What is foolish about that? " 
Four out of 5 right for credit at this level. 
3. Response to Pictures. (Picture of messenger boy with broken bicycle.) 
“Look at this picture and tell me all about ate 
Three essential facts must be mentioned for credit. 
4. Repeating 5 Digits Reversed. 
"[ am going to say some numbers, and I want you to say them back- 
wards." 
One out of 3 correct for credit. 
. Abstract Words. 
What do we mean by courage? 
Two out of 4 for credit at this level. 
6. Sentence Completion. (Four sentences with missing words.) 
“Write the missing word in each blank. Put just one word in each." 
Two out of 4 required for credit at this level. 


tn 


SUPERIOR ADULT—LEVEL 11 

1. Vocabulary. (Same as 6-year level.) 
Twenty-six words for credit at this level. 

2. Finding Reasons. (Two parts.) 
“Give three reasons why a man W 

punished.” 
Both parts right for credit. 

3. Repeating 8 Digits. (Three series.) ү . 
"Say them just the way I do. Listen carefully, and get them just right." 
One out of 3 trials for credit. 

4. Prowerbs. (Bird in hand: silk purse 

“ Here is a proverb, and you are suppose 

Both correct for credit. 

Reconciliation of Opposites. (Six parts ) 

"In what way are heavy ind light alike? " 

Five out of 6 for credit. 

6. Repeating Thought of Passage: 
“Tam going to read a short para 

as much of it as vou can." 


Words like mosaic, flaunt. 


ho commits a serious crime should be 


out of sow's ear.) 
d to tell what it means." 


tn 


+ Value of Life 
graph. When I am through . . . repeat 


The above examples illustrate the variety of material included in 
the test. Note that the specific tests vary from one level to another. 
The tests at the lower age levels are quite concrete, dealing with 
little objects and pictures. At the upper levels, they tend to be more 
abstract and quite heavily verbal. The various tests include tasks 
calling for display of past learnings, perception of relations, judg- 


212 TESTS OF INTELLIGENCE 


ment, interpretation, sustained attention, immediate memory, and а 
variety of other cognitive processes. 

The tasks were selected so as to be of appropriate difficulty. for 
the average child of the age level to which they are assigned. A 
child is tested by starting at a level at which it is anticipated that 
he will pass all tasks. If he fails any, the examiner drops back to 


an easier level. If he passes them all, the examiner moves ahead 
level by level until the child fails all tasks at one age level. The 
child is credited with the basal age at which he px plus 
a credit for tasks passed at more advanced levels. E 
at a given level credits the child with the same number of months 
of mental age. Thus, where there are 6 tests at each year age level, 
passing a single test gives a credit of 2 months of mental age. For 
example, child A 


s all t 
zach task passed 


= 6 yrs. basal age 


= 6 mos. credit 


= 2 mos. credit 
= 0 credit 


Resulting in a mental age of 6 vrs., 8 mos. 


Level of achievement is expressed as a mental age, arrived at as 
indicated above. But this takes no account of the child's life age. 
In order to take account of life age and thus relate a child's per- 
formance to that of a reference group of contemporaries, mental age 
is divided by chronological age, the actual time lived. * Thus, if the 
child referred to in the example given above were 5 years and 10 
months old, we would have 


Mental age = 80 months 
ote RR EE = 1.14 
Chronological age = 70 months 
Multiplying by 100 to get rid of the decimal, we have 114 as the 
intelligence quotient (1.Q.) of this child. Ina representative group of 
children, the Revised Stanford-Binet gives I.Q.'s with an average of 
100 and a standard deviation of about 16! 2 points. The way 1.Q.'s 
spread out is shown in Table 7.7 (p. 171). Thus, a child with an 
LQ. of 130 would surpass about 95 per cent (95.5 per cent by Table 
7.7) of children of his age; one with an I Q. of 90 would surpass about 
23 per cent (22.7 per cent by Table 7.7). We can estimate that the 
Actual life age is used up to the age of 14 ve 


quired above this age, since test intellige! 
adulthood. 


ars. Special procedures are re- 
nce does not continue to increase during 


THE WECHSLER-BELLEVUE INTELLIGENCE SCALE 213 


child whose J. O. we computed above as 114 would surpass roughly 


75 t0 80 per cent of children of his age. 


THE WECHSLER-BELLEVUE INTELLIGENCE SCALE 


The second major individual intelligence test is the Wechsler- 
Bellevue Intelligence Scale. This test was originally developed for 
adults, and the materials and tasks were chosen with an eye to their 
The pattern of organization of the test 
differs from that of the Binet. Whereas the Binet, developed for 
children, is organized in successive age levels, the Wechsler is or- 
s of tasks. The subtests are 


appropriateness for adults. 


ganized by subtests representing type 
the following: 


Verbal Subscale Performance Subscale 


1. General Information. 7. Picture Arrangement. 

2. General Comprehension. 8. Picture Completion. 

3. Arithmetical Reasoning. 9, Object Assembly. 

J. Digit Span. 10. Block Design. 

5. Similarities. 11. Digit-Symbol Substitution. 


6. Vocabulary. 


Tasks like those in a number of the subtests will be found among the 


examples on pp. 206-208. 
1 р! ; Tm 
t of the Wechsler vields a separate score, which is then 


Each subtes 
score for that subtest. The subtest stand- 


converted into a standard 
Jin three different groupings to yield total 


scores, and from these total scores three different types of 1.Q.’s 
may be read from norm tables. The three LQ.'s are (1) a verbal 
LQ. from subtests 1 6, (2) a performance 1.Q. from subtests 7-11, 
and (3) a total 1.0. from all the subtests put together. The separate 
verbal and. performance 1.О.'з may have diagnostic significance in 
the case of certain individuals with verbal, academic, or cultural 
handicaps. The J. O. on the Wechsler is frankly a standard score, set 
to make the mean of the normative sample 100 and the standard 


deviation 15. 
As we have indicated, the 


Scale was designed for adults. 
and with adults of all ages. Subsequently, however, the material has 


been extended downward to make a test for children. 9? The same 
general pattern of subtests has been used, though with minor varia- 
the nature of the tasks in several of the sub- 
lown to the easiest items. The Wechsler 
usable from age 5 to 15. 


ard scores are combine 


original Wechsler-Bellevue Intelligence 
It was suitable for use with adolescents 


tions. In particular, 
tests changes as one goes € 
Intelligence Scale for Children is designed to be 


214 TESTS OF INTELLIGENCE 


The features that distinguish the Wechsler-Bellevue from the Re- 
vised Stanford-Binet are: 


1. Original test items specifically designed for adults. 

2. Organization by subtests rather than by age levels. 

3. Determination of I.Q. directly from test score, without any 
intervening M.A. 

4. Provision for separate verbal and non-verbal I.Q.'s. 


All these features seem like sound adaptations in a test for adults. 
Most psychometricians would probably agree now in preferring the 
Wechsler-Bellevue as a measure for adolescents and adults, though its 
relation to academic success is perhaps not as clearly established as 
is the Binet's. (As a matter of fact, at these ages a printed group 
test would usually seem more appropriate for academic prediction.) 

The Wechsler Intelligence Scale for Children (WISC) cannot be used 
with children below 5 and is probably not too satisfactory below the 
age of about 7. For young children the Binet would be generally 
preferred. In the age range from 7 to 15, a decision between the 
two tests is not an easy one. The Binet is reported to be somewhat 
more difficult and time-consuming to give. The usual Binet pro- 
cedure of carrying the examinee through to the point where he ends 
with a series of failures is judged to be a seriously upsetting matter 
for some emotionally tense children. The separate verbal and per- 
formance I.Q.'s of the WISC should be quite useful in some 
understanding children whose verbal development is either very 
accelerated or retarded. It has diagnostic value for some children 
with special educational disabilities. However, the Binet is prob- 
ably a somewhat more reliable measure. (No directly comparable 
data are available.) The test items entering into the Binet have had 
the benefit of trial in earlier forms, with opportunity to revise and 
select on the basis of that experience. The ultimate basis for choice 
will be the validity of the inferences that can be made from each in 
the situations in which they are actually used. The Wechsler In- 
telligence Scale for Children is too new for experience to provide the 


basis for a decision on its relative effectiveness in school and clinic, 
as compared with the Binet. 


cases in 


NON-LANGUAGE AND PERFORMANCE TESTS 


Most of the widely used intelligence tests depe 
upon language and include tasks present 
natural, since the bulk of our learning 


nd to some degree 
ed in verbal terms. This is 
and thinking makes use of 


NON-LANGUAGE AND PERFORMANCE TESTS 215 


language. For the usual person and in relation to the usual type of 
academic learnings, aptitude for learning can be tested more effi- 
ciently by tasks that involve language than by those that do not. 
However, for some groups or situations this is not so. The most 
obvious example is that of groups who do not speak the language or 
speak it only slightly. When English is not the native tongue, re- 
sults from a verbal test in English are in large measure meaningless. 
Children who have had little opportunity to attend school may suffer 
a special handicap on a test that relies upon materials close to school 
learnings. For groups of this sort, tests have been developed that 
do not require language. In some of these just the test tasks are 
non-language in character; in others the instructions can be given 
by pantomime and no language need be used at any point during 
the testing. 

A recent group test that requires no language in solving the test 
instructions are presented in words, is the 
Lorge- T horndike Intelligence Test, Non- Verbal Series. Types of tasks 
that are included are figure analogies, figure classification, and num- 
A group test that dispenses with 
test is the Pintner Non-Language 
pantomime. The 
e illustrated in 


problems, though the 


berseries, (Sec examples on p. 206. 
language in either instructions or 
Test, in which all instructions may be given by 
test includes the following types of tasks, which ar 
Fig. 9.2. 


1. Figure dividing. indicating which line or lines will divide a figure up 
to give a specified set of parts. 

2. Reverse drawings, indicating 
mirrored drawing. 

3. Pattern synthesis, indicating the figure 
imposing two figures. 

4. Movement sequence, sel 
sequence established by three 

5. Manikin, selecting the manikin that 
stem, except rotated in some way. 

6. Paper folding, selecting the diagram that shows how a paper folded 
and cut in a specified way will look when unfolded. 


the line or lines needed to complete a 
that will result from super- 


ecting the figure that follows the movement 


figures in the stem of the item. 
is the same as the one in the 


When used with ordinary school groups, a test such as the Pintner 


Non-Language provides an appraisal of intelligence somewhat dis- 
tinct from that provided by a verbal measure. Thus, this test cor- 
relates only about 0.65 with the Pintner General Ability Test, Lan- 
guage Series, a test made up of verbal and arithmetical material. 
the non-language test may be expected to be 


With usual groups, 
dictor of school achievement. The 


somewhat less effective as a pre 


216 TESTS OF INTELLIGENCE 


о 


Fig. 9.2. Sample items from Pintner Non-Lenguage Intelligence Test. 


(Copyright 1941, World 
Book Co., Yonkers, N. Y. 


Reproduced by permission.) 


value of the non-language test is for atypical individuals or groups, 
iLe., the deaf, the foreign born, or the academically retarded. 
Individual tests are also available that do not require the use of 
language. We have already described the Wechsler-Bellevue Intelli- 
gence Scale and referred to the Performance LQ. provided by this 
test. The Performance J. O. is based upon five subtests that do not 
require the subject to use language once he has been instructed as 
to the nature of his task. A performance test that is widely used 
with children, as a supplement to the Binet when a verbal handicap 
is suspected, or for groups with which the Binet would not be appro- 
priate is the Arthur Point Seale? We shall describe it in some detail, 
since it is a good representative of individual performance tests. 


NON-LANGUAGE AND PERFORMANCE TESTS 217 


The Arthur Point Scale consists of two forms, which contain some- 


what different tests. Form I has nine subtests, as follows: 


The examiner taps four cubes in a specified sequence, 
quence, 
figures are to be placed into the 


1. Knox Cubes: 
and the subject must reproduce the sı 

2. Seguin Form Board: Ten geomet 
corresponding holes in the board as rapidl 

3. Two-Figure Form Board: Cut-up piec 
and cross cut out of the board. 

4. Casuist Form Board: Similar to the above, only four figures. 

5. Manikin or Feature Profile (depending on level): Cut-up figure of 
man or cut-up face is to be assembled. 

6. Mare and Foal: Picture has cut-outs that are to be fitted into place. 

7. Healy Picture Completion I: Picture has square cut-outs, and subject 
must select the appropriate block to make the most meaningful picture. 

8. Porteus Mazes: Simple pencil mazes are to be traced without retrac- 


s possible. 
are to be fitted into a square 


ing or crossing a line. 
9. Kohs Block Design: Designs are to be reproduced using colored cubi- 


cal blocks, like those in sets for children. 


Form II of the test also uses the Knox Cube, Seguin Form Board, 
Healy Picture Completion, and Porteus Mazes, presenting a different 
form or a different set of tasks from Form I. In place of the other 
tests, however, it substitutes the Arthur Stencil Design Test. In this 
test, the subject is supplied with a set of colored cards and a set of 
cut-outs of different designs and colors. The subject is shown a de- 
sign that can be produced by superimposing certain ones of the 
cards provided to him. He must select the right cut-outs and back- 


ground and put them together in the right order to produce the 


master design. 
A point scor 


subtest of the Arthur Scale. 3 
upon the speed with which the task was completed, in others upon 


ess of the solution or the number of graded tasks solved. 
or the subtests are summed to give a total point 
a mental age equivalent. Ап J. O. is 
computed in the same manner as for the Binet, and I.Q.'s appear to 
have about the same distribution as for the Revised Binet. 

There have been a number of other attempts to evaluate intellec- 
tual ability through performance tasks, ideally ones that would be 
usable in different countries and different cultures. One of the most 
Goodenough Draw-a-Man Test," in which the 
a man—the best man you can draw." 
and maturity of repre- 


e is allowed the subject for his performance on each 
The score depends in some subtests 


the correctn 
The point credits f 
score, and this is converted to 


widely known is the 
child is told simply, “Draw 
The performance is scored on completeness 
sentation, not on esthetic qualities. 


218 TESTS OF INTELLIGENCE 


The individual performance test must generally receive the same 
evaluation as the group non-language tests. For an English-speak- 
ing person with normal environmental opportunities and without 
specialized language or reading handicap, it represents a less efficient 
way of appraising mental development than the more widely used 
verbal test. However, as a way of checking on whether there is a 
specialized language handicap it represents a valuable supplemental 
tool. It makes it possible to check upon individuals who appear re- 
tarded on the verbal type of test to see whether the retardation is 
general or whether it is a localized deficiency in the language area. 
A performance test such as the Arthur Point Scale, which can be 
given with pantomime instructions, is also useful in testing deaf 
children, non-English-speaking children, and other types of special 
groups. 


INFANT AND PRESCHOOL TESTS 


The first intelligence tests were made for school-age children. 
However, it was not long before the theoretical interests of child 
psychologists and the practical needs of child-care and placement 
agencies stimulated the attempt to develop procedures for appraising 
intelligence in preschool children and even in infants. Any appraisal 
procedures with young children obviously had to be individually ad- 
ministered. Also, they had to be based upon behavior that was 
spontaneously exhibited by or could be clicited from children of the 
age being studied. Infant tests, therefore, had to take on a very dif- 
ferent character than later appraisals. Arnold Gesell " pioneered in 
designing tests based on observation of the child's postural, percep- 
tual, manipulative, and social responses. Does he sit up? Stand 
up? Walk? Will he turn to look at a light? Notice a face? Can 
he pick up a block? A spoon? A little pellet? By what type of a 
grasping motion? How does he react to strange adults? То an- 
other infant? 


Observations of large numbers of infants showed a typical de- 
velopmental sequence in the different aspects of the child's develop- 
ment. Performance B followed A, and was followed by C. Norms 
have been established representing the average age at which a partic- 
ular skill manifests itself. The child may be assigned a developmental 
age, based upon the tasks he can do. Retests after a short interval 
show the child to be fairly consistent in his level of performance. If 
he is advanced at one testing, he will tend to be advanced at the 
other. The developmental schedules provide a moderately reliable 
picture of the individual at that point in time. 


INFANT AND PRESCHOOL TESTS 219 


What significance docs acceleration or retardation in development 
during the first year or so of life have for predicting later intelli- 
gence? The answer is well presented in Table 9.1, which shows the 


Table 9.1. Correlation of Intelligence Tests During First Year of Life with 
Later Measures * 
(Correlations based on pooling of successive tests) 


Age at Initial Test 


Age at Later 
10, 11, 12 mos. 


Test 1,2,3 mos. 4. 5, 6 mos. 7, 8, 9 mos. 
4, 5, 6 mos. „57 
7, 8, 9 тпоз. .42 72 
10. 11. 12 mos. 28 .52 .81 
13, 14, 15 mos. .10 .50 .67 ,81 
18, 21, 24 mos. —.04 .23 .39 .60 
27, 30, 36 mos. —.09 .10 222 .45 
42, 48, 54 mos. —.21 —.16 .02 d 
5, б, 7 yrs. =.13 -.07 .02 .20 
8, 9, 10 yrs. —.03 —.06 .07 .19 
11, 12, 13 yrs. .02 —.08 .16 .30 
14, 15, 16 yrs. —.01 —.04 .01 .28 
17, 18 yrs. .05 —.01 .20 E 


California. First- Year Mental Scale; 18 


; used were: 1-15 months, 
6 years and older, Stanford-Binet. 


ars, California Pre-school Scale: 


correlation of infant tests given at the ages of 1 to 12 months with 
arious later ages. The tests during the first 
California First- Year Mental Scale, those 
California Pre-school Scale, and 


intelligence tests at у 
15 months were those of the 
from 18 months to 5 years were the 
those from 6 years on were the Stanford-Binet. 

The picture seems quite clear. The infant tests give a fairly good 
Prediction of developmental status a few months later, but their 
apidly as the interval increases. The 
y no prediction of intellectual status at 
luce differences in rate of develop- 
entirely distinct from those 
It seems, then, that 


value as predictors drops r. 
infant tests provide essentiall 
School age. Whatever factors proc 
ment during the first year or 50 of life are 
that determine intellectual level at school age. 
little practical significance attached to results from infant 
developmental schedules. an aspect of the child 
which is temporary only, 
There have been a numbe 
for use with preschool children. 


can be 
They describe 
not lasting. 

r of different tests prepared primarily 
As a matter of fact, as we have 


220 TESTS OF INTELLIGENCE 


seen, the Revised Stanford-Binet has tests going down to the 2-year 
level and may be considered a preschool test. It would compare 
very favorably with the other tests available for this age level, though 
it is somewhat more verbal than many of the others. А good many 
of the preschool tests have tended to get away from the verbal mate- 
rial that appears so heavily in group tests for older children and also 
in the Stanford-Binet. 

One test for preschool children that has received wide use is the 
Merrill-Palmer Scale? This is most suitable for children from 2 to 
4, though it can be used with children slightly older and slightly 
younger. The test is made up of 38 little subtests, of which only 4 
call for verbal response by the child. A number of the tasks call for 
gross motor coordination (standing on one foot) or finer eye-hand 
coordination (building block tower, cutting with scissors). Form 
and object perception and motor control combine in a number of 
form-boards in which cut-outs must be fitted into the appropriate 
hole. The tasks make use of a variety of materials interesting to 
the child, blocks, pictures, scissors, balls, etc., so that cooperation 
can usually be obtained, a real problem with children at these ages. 

The Merrill-Palmer Scale has fairly satisfactory reliability, espe- 
cially above about 30 months. Correlations with retests 6 months 
later have been reported ? as follows for different age groups: 


24 months 0.63 
30 months 0.76 
36 months 0.78 
42 months 0.80 


The correlation with school-age Binet is about 0.40 for a Merrill- 
Palmer test at age 2; about 0.45 to 0.50 for one at age 4. 

The Minnesota Preschool Scale ? is another example of a test de- 
signed for preschool groups. The 26 tests in this scale tend to be 
more like those of the Binet. Six tests taken at random from one 
form of the Scale are described briefly. They are 

Test 2: Pointing Out Objects in Pictures. 
house, and flower on it. Child is asked to point to each in turn. 

Test 5: Imilative Drawing. verimenter makes vertical stroke; then 
across. Child is asked to imitate each in turn. 


Card with man, chair, apple, 


Test H. Imitation. A set of 4 cubes, on which experimenter taps in speci- 
fied sequence. Child instructed to imitate the sequence of taps. 

Test 14: Colors. Cards colored red, blue, pink, white 
Child is asked to name the color. 

Test 20: Paper Folding. Examiner folds paper with three consecutive 
folds. Child is asked to copy exactly. 

Test 24: Giving Word Opposites. Child is asked to give words mean- 
ing opposite of cold, bad, thick, dry, dark, and sick. 


. and brown. 


E 


CULTURE-FREE AND CULTURE-FAIR TESTS 221 


Test materials are quite simple. Copying, imitating, and responding 
to simple verbal relations enter into a number of the tests. 

This test appears to be somewhat more reliable than the Merrill- 
Palmer. Correlation between two forms of the test given within a 
few days of cach other was found to be 0.89. Below 3 years, this 
test did not correlate very well with later Binets, but the Minnesota 
given between 3 and 4 gave a correlation with Binets at school age 
of about 0.60. However, LQ.'s on the Minnesota Preschool Scale 
nt spread from those for the Binel so a preschool 


have quite a differe 
1.О. on this test is not readily equated to later Binet performance. 


(See reference 14.) 


CULTURE-FREE AND CULTURE-FAIR TESTS 


Many workers in the field of aptitude testing have been distressed 
by the fact that test performance depends upon the experiences the 
Every test maker has recognized this to a degree 


person has had. 
items upon experiences that would be 


and has tried to base test 
common to the group for whom the test was planned. But some 
have perhaps taken too narrow a view of the group for whom the 
experiences should be common. Certainly the test that incorporates 
pictures of the usual American house, automobile, or football is not 
suitable for an Australian Bushman who has seen none of these ob- 
jects. The typical American test assumes the common core of an 
American culture. Some critics have gone further and asserted that 
the typical test is based upon an urban middle-class American cul- 
ture. Both in its highly verbal content and its emphasis upon speed, 
competition, and doing one’s best, it is said to be centered in the 


middle-class culture and values. 


Several attempts have been made to develop tests that are “cul- 


if not that at least "culture fair." These are closely 
rbal and performance tests described in the pre- 
free test is almost necessarily non- 
rbal but must also be free of the 


ture free," or 
related to the non-ve 
vious section, because a culture- 
verbal. It must not only be non-ve 


content of any particular culture. Р 
One attempt to develop such a test is the Cattell Culture Free Intelli- 


gence Test. The Cattell Test is based on the premise that general in- 
telligence is a matter of seeing relationships in the things with which 
We have to deal, that the ability to see relationships can be tested with 
simple diagrammatic or pictorial material, and that for a test to be 
usable in different cultures the pictures should be of forms or objects 
which are fairly universal, i.e.. not peculiar to any cultural group. 


Items showing the different. types of tasks are shown in Fig. 9.3. 


222 TESTS OF INTELLIGENCE 
PART I- CLASSIFICATIONS 
РЕЗО 
А в c D—— E 
PART Ill - SERIES 


DOW DE 


——— в c F 


DIDO 
— 
— _ 
QOQ 
— ее 


Fig. 9.3. Sample items from Cattell Culfure-Free Intelligence Test. (Copyright 1944, Institute 
for Personality and Ability Testing, 1608 Coronado Drive, Champaign, Ill. Reproduced by 
permission.) 


CULTURE-FREE AND CULTURE-FAIR TESTS 223 


The evidence that the test is in fact useful for widely different cul- 
tures is largely lacking, but the tasks constitute one further interest- 
ing non-verbal group test that may prove usable, particularly in 
research studies. 

A more recent test planned to be culture-free is Rulon's Semantic 
Test of Intelligence? This test again uses pictorial and diagrammatic 
material entirely. Starting with pictures to represent a cow, a cat, 
a man, a woman, and a child, a series of items is used to develop 
artificial symbols for those classes of objects and for a number of 
action verbs. The test results in building up essentially an artificial 
language, based upon experience with the pictures and the symbols 
representing them. The test uses pantomime instructions and can 
presumably be used with any group that can learn to manipulate a 
pencil and mark the test booklet. This test is so new that little evi- 
dence is available as to its effectiveness in measuring intellectual de- 
velopment in our own culture or its fairness to individuals from other 


cultures. 

An attempt to develop а 
classes in American society 
sumably the child is suppos 
is a test.) This test series 
require quite long oral directions. 


test that imposes no penalty on different 
is found in the Davis-Eells Games. (Pre- 
ed to be naive and not realize that this 
involves no written language but does 
Types of items include: 


1. Best ways, in which three pictures are shown in the test book- 
let, and the 'examinee is orally instructed to mark the one that is 
the best way to carry à pile of packages, get over а fence, etc. 

2. Analogies, in which the analogies are presented in pictures and 
are of the type, "Glove is to hand as sock is to: arm, leg, foot." 

3. Probabilities, in which a picture is shown and the examinee 
must select the one of three orally presented choices that indicates 
what probably led up to or is represented in the picture. | 

a. “ Money," a task based on complex directions for following cer- 
tain rules for combining coins to make specified sums. 

This test was designed to avoid the cultural biases thought to 
characterize previously existing tests, particularly socio-economic 
biases within the American culture. Evidence that the authors pre- 
Sent on this point is the average difference in item difficulty for high 
Socio-economic and low socio-economic groups. The difference for 
the Davis-Eells Games is, according to the authors, very much smaller 


than that for other group tests with whic 1 Y é 
However, evidence is presented only upon the single items. If inde- 


h these tests were compared. 


224 TESTS OF INTELLIGENCE 


pendent evidence on the test as a whole confirms the lack of rela- 
tionship to socio-economic level, an interesting type of instrument 
will be available. 

The freedom from cultural influences in the Davis-Eells test. in so 
far as it has been achieved, has been accomplished at considerable 
cost in testing efficiency. An hour and a half to 2 hours of testing 
time are required to achieve a reliability (split-half) of about 0.82. 
Correlations with school achievement are somewhat lower than for 
the usual test. 
useful in the school setting, or whether it is only an interesting re- 


It remains to be seen whether this test will prove 


search tool. The same may be said of the other culture-free tests. 


GROUP VERSUS INDIVIDUAL TESTS AS MEASURES 


OF INTELLIGENCE 


We have seen that intelligence tests fall into two main patterns, 
group tests and individual tests. The types of tasks presented to 
the examinee are a good deal alike in both patterns. However, the 
two procedures have certain significant differences. 


These may be 
summarized as follows: 


Group Tests 

Problems presented in printed 
booklet. Read by examinee. Per- 
sonal contact with examiner a min- 
imum. 

Tasks presented and test timed 
as a unit, or separate time limits 
for each subtest. 


Individual Tests 


Problems presented orally by ex- 
aminer in face-to-face situation. 


Problems presented one at a 
time, usually without indication of 
time limits. 


Individual usually responds by Individual usually responds 
selecting one of a limited set of re- frec giving whatever 
sponse options printed in the test seems appropriate to him 
booklet. : 


response 


These differences in procedure have several important implications 
for the conduct of testing and for the results that may be obtained 
from such testing. In the first place, when test tas are presented 
orally to the subject and he does not have to read them for Füsse 
his performance is much less dependent upon his r The 


ading skills. The 
child who has lagged behind in acquiring these skills is not penalized 
for this specific failure. The effect of reading disability upon anelli 
gence test performance is shown clearly in a study“ comparing in- 
dividual Stanford-Binet scores and group-test scores of idm 


normal, and accelerated readers in the sixth grade. For those chil- 


GROUP VERSUS INDIVIDUAL INTELLIGENCE TESTS 225 


dren whose reading was a year or more accelerated (in relation to 
Stanford-Binet mental age), group-test I.Q. averaged 15 points higher 
than the individual Stanford-Binet 1.Q. Where reading was within 
+ or — 1 year of Stanford-Binet mental age, the group test I.Q. was 
2 points higher. Where reading was retarded a year or more, group 
test Т.О. fell 8 points below the Stanford-Binet 1.0. Thus, the accel- 
erated reader received a 15 point bonus, the retarded reader an 8 
point penalty in J. O. on the printed group test as compared with 
the individual test orally administered. 

The results reported above are probably somewhat extreme, be- 
ar group test was very verbal in nature and be- 
arried out with elementary-school children, for 
of reading still represents something of a 
would be found for 


cause the particul 
cause the study was с 
whom the actual operation 
task. One may anticipate that less difference 
high-school or college students. Furthermore, some current group 
tests are either partly or wholly non-language in their content and 
would be relatively independent of reading skills. However, this 
study points out very clearly the caution with which a group test 
LQ. must be interpreted for a person who departs markedly from 
the average in his reading skills. A low group test 1.Q. for a poor 
reader cannot be taken at face value. It should always be checked 
With a test that does not involve reading. 

problems one at à ti 
significance in determining what the test is 
er children, maintaining con- 


The presentation of me by an examiner is 


also a factor of some 
likely to yield. E pecially with voung 
tinuity of attention and effort on à group test may be a problem, and 
variations in this respect are certainly a significant factor in test 
score. When each problem is separately presented by the examiner, 
this serves to re-establish the child's orientation to the task and to 
What is equally important, the examiner is in a 
erest and effort and to take some 


maintain his effort. 
position to observe lapses of int 


account of them in interpreting the results. 
The individual intelligence test is essentially a well-standardized 


The tasks to be presented to the examinee are 
specifically formulated, and detailed standards are provided for 
evaluating his responses. However, at the same time, the face-to- 
face relationship of an interview. prevails. This offers the alert 
examiner a wealth of opportunities for observing the examinee and 

signs of anxiety and upset, 


noting poor motivation, distractability. 

and other cues that will help in interpreting the actual test perform- 
ance. At the same time, the demands upon the examiner are con- 
siderably heavier. If valid testing is to result, the tasks must be 


interview situation. 


226 TESTS OF INTELLIGENCE 


presented in a standard way, interest and cooperative effort must 
be maintained, and a uniform standard must be applied in evaluat- 
ing responses. 

The free-response item in the individual test fits into the inter- 
view setting of the individual test and reinforces both its strengths 
and its limitations. Potentially, the free response of the examinee 
can tell us more about him than the mere record of which option he 
has chosen from a set of five. There is more of the quality of his 
own behavior available to us. We can see just how he goes about 
defining a word, whether by class and differentia (i.c., an orange is a 
round, orange, citrus fruit) or by use (an orange is to cat). We can 
note the speed and sureness of his attack on a problem task. But 
we must also depend on the examiner to interpret and evaluate the 
responses, and at this point subjectivity is likely to creep into the 
examining. Careful attention must be paid to the standard samples 
provided in the test manual, and experience under supervision is in- 
dicated before an examiner can expect to give and score an individual 
intelligence test in a way that will yield results comparable to those 
of other examiners. 


In general, the limitations of group tests are most acute and the 
advantages of individual tests most pronounced with young children. 
Printed group tests cannot be used successfully with children below 
school age. They cannot read and have difficulty in manipulating a 
pencil, following instructions, or maintaining sustained attention for 
the period that is required for taking a test. These same factors 
continue to present fairly serious problems for testing in the primary 
grades. However, the factor of cost makes individual testing im- 
practical for most large-scale users of tests, so that with older in- 
dividuals the overwhelming majority of the intelligence tests used 
are paper-and-pencil group tests. 


RELIABILITY AND STABILITY OF MEASURES 
OF INTELLIGENCE 


We have already presented some evidence on the reliability of 
measures of intelligence in our discussion of infant and preschool 
tests. The reliability of those early measures is found to be quite 
modest. For tests at school age, reliabilities are more promising. 
Considering the group tests first, we find that when correlations be- 
tween two forms of the same test are reported for an age group ora 
grade group they usually fall between 0.80 and 0.90. A few are 
higher. Unfortunately, the authors of some tests report only odd- 


RELIABILITY AND STABILITY OF MEASURES 227 


even reliabilities, and it is difficult to estimate how much these are 
inflated. (See discussion on pp. 129-130.) Comparisons of different 
tests are made difficult by variations in the procedure used for esti- 
mating reliability and in the type of group for which results are re- 
ported. 


The correlations reported by the 
Form M of the Revised Stanford-Binet ranged from 0.85 to 0.95 for 


different age groups. For ages from 2 to 6, the median value was 
0.88, whereas for ages above 6 the median was 0.93. The Wechsler 
Intelligence Scale for Children ? has a reported reliability (split-half 
corrected) for the full scale of 0.92 at age 715, 0.95 at age 1015, and 
0.94 at age 1315. These correlations are based upon the sample of 
200 children at cach age level that was used in developing norms for 
the test. Data on the reliability of the Wechsler-Bellevue for adults 
w of the extensive use that has been 
made of the test. In one test-retest study with a normal group ê 
the standard error of measurement was reported to be 3.29. This 
a reliability coefficient of 0.95 in a group with 
s, which represents the variability 

The author reports? a stand- 
id this would correspond to a 
0.90 is not far from the cor- 


authors 9 between Form L and 


are surprisingly meager in vie 


would correspond to 
the standard deviation of 15 point 
of the test in the norming population. 
ard error of measurement of 5.67, ат 
reliability coefficient of 0.86. Perhaps 


rect value. 
Though the variations in procedure for estimating reliability and 


in type of group tested. make it difficult to arrive at an unequivocal 
he individual intelligence tests vield a 
somewhat more reliable measure than do the commonly used group 
tests, This is probably in part a reflection of the somewhat longer 
actual testing time, in part a result of more uniform motivation and 
effort when working under the eye of the examiner. 

The reliabilities of intelligence tests are reasonably satisfactory, 
and they are among the most dependable psychological measuring 
c chance errors in an 1.Q. are still enough 
to require that we be quite tentative in our interpretation. Thus, 
Table 9.2 shows the spread of 1.Q.'s that could be expected on Form 
M of the Binet if that form were given to a group of pupils all of 
whom had received exactly the same 1.Q. on Form L. Note that 
the LQ.'s spread over a range of more than 25 points, and that less 
than a third of the cases fall in the center 5-point interval. And 
it must be remembered that these figures are for the Stanford-Binet, 
one of our most reliable tests. Thus, an I. O. of 100 must not be 
thought of as meaning “exactly 100,” but rather “probably between 


answer, it does seem that t 


Instruments. However, the 


228 TESTS OF INTELLIGENCE 


95 and 105, very probably between 90 and 110, almost certainly be- 
tween 85 and 115." 


Table 9.2. Distribution of Stanford-Binet Form M I.Q.'s for Cases with 
Identical Form L I.Q.'s 


I. Q. f 
1134- 3 
108-112 9 
103-107 23 
98-102 30 
93-97 23 
88-92 9 

87 and below 3 


STABILITY OVER A PERIOD OF YEARS 


In addition to knowing the precision with Which an intelligence 
test appraises an individual's abilities at a particular time, we would 


1.00 marre 
. J0yr tests 
8,9 yr tests 7 
0.80 y c 
6,7 yr|tests 
4,5 yr tests 

E cn E e ae 
® ~= 
= 0.60 IR TM үе: | 
g 8 
E M 
E] 
2 30, 36 mo tests 
5 паа 
S 040 ET E a 
© 0. 
8 idem ad T "Sw 


0.20 


0.00 


8 9' 10 


12-13 14-15 
Age at later test 


Fig. 9.4. Effect of age at initial testing and test-retest interv, 


‘al on prediction of later Stanford- 
Binet l. O. from earlier rest. (Adapted from Honzik, 


McFarlane, and Allen. 


like to know how consistently the 


individual maintains his position 
in his group from one year to the 


next or over a considerable span of 


STABILITY OVER A PERIOD OF YEARS 229 


years. How confidently can we predict what scholastic aptitude an 
individual will show when he is of college age from his performance 
on a test at age 2? Age 6? Age 10? Evidence on this point is pre- 
sented in Figs. 9.4 and 9.5. 


1.00 T—-— 4 


0.80 


0.60 
5 
s 
S 
8 
0.40 -+ - 
| 
0.201—4—4 
O00 oco oe GE за Ss Wu O 


Years between tests 


erval on prediction of group test intelligence at end of high 
(Study A adapted from J. E. Anderson.! Study B adapted 


from R. L. Thorndike.") 


Fig. 9.5. Effect of test-retest int 
school from earlier group tests. 


Figure 9.4 shows the findings from one extensive study using in- 
dividual tests. The final test is the Stanford-Binet in every case. 
The initial test is the California Pre-school Scale up through 5 years 


and is the Stanford-Binet after that age. Note that for the early 


tests the prediction is rather poor and drops as the interval is m 
creased. A test at age 2 correlates only 0.37 with one at age n айн 
0.21 with one at age 14 or 15. As we go up the age range, ow t 
the correlations are higher and the drop is less. A test given at 1 5 
8 or 9 correlates 0.88 with one at age 10 and still тышы 

With one at age 14 or 15. For normal children in ү aus 
ment, a Stanford-Binet at age 8 or 9 appears to provide altos | 
accurate a forecast of ability near the end of high school as woulc 


the same test given several years later. 


230 TESTS OF INTELLIGENCE 


Two sets of data on stability of group-test performance over time 
are presented in Fig. 9.5. The two follow the same general pattern, 
though they differ a good deal in detail. As we go back further in 
time, the correlation coefficients tend to drop more or less steadily. 
The earlier tests at around grade 3 or 4 correlate perhaps 0.50 to 
0.60 with the final test, but for a test in grade 9 or 10 the correlation 
is 0.70 to 0.80. In these studies of group tests, the tests that were 
used differed at the different ages. For this reason, it is not clear 
how much the lower correlation over the longer intervals is due to 
growth changes in the subjects over a span of years and how much 
it is due to changes in the material included in the tests. From the 
practical point of view, Fig. 9.5 suggests that a group intelligence 
test needs to be supplemented by new testing every 3 or 4 years if 
pupil records are to provide an accurate indication of current ability 
level. 


THE PRACTICAL IMPORTANCE OF INDIVIDUAL 
DIFFERENCES IN MEASURES OF INTELLIGENCE 


To what extent are the individual differences that are brought out 
by tests of intelligence of importance in the practical affairs of life? 
Do they enable us to predict to a useful degree how an individual 
will perform in school, on a job, or in other life adjustments? 


INTELLIGENCE AND SCHOOL SUCCESS 


First, let us consider academic success. From the many hundreds 
of investigations of intelligence test scores in relation to academic 
success, a number of conclusions can safely be drawn. These may 
be summarized as follows: 

1. The Correlation of Intelligence Test Score with School Marks Is 
Substantial. Viewing all the hundreds of correlation coefficients that 
have been reported, a figure of 0.50 to 0.60 might be taken us fairly 
representative. Though this constitutes a very definite relationship 
it is only necessary to turn back to Fig. 5.7 and the discussion of 
correlation on p. 103 to realize that there are still many marked dis- 
crepancies between intelligence test score and what а sires 
youngster does in school. 

2. Higher Correlations Have Been Found in Elementary Schools 
Than in High Schools and in High Schools Than in Colleges Past 
studies have indicated a drop in correlation from perhaps 0.70 н 


elementary school to 0.60 in high school and 0.50 in college. "This 


INTELLIGENCE AND SCHOOL SUCCESS 231 


situation in the past has probably been due to the progressive elim- 
ination of individuals with lower intellectual ability as one went up 
Less accurate discriminations were possible 
in the selected group remaining. The historic trend may be some- 
what changed with the increasing per cent of children continuing in 
school and with less emphasis on marks in the elementary grades. 

3. Previous School Achievement Has Given Correlations with Later 
School Success as High as or Higher Than Intelligence Test Score. In 
predicting college marks, for example, high-school record has shown 
correlations at least as high as those resulting from а scholastic 


the educational ladder. 


aptitude test at entrance. 
4. Intelligence Test and Achi 
diction. By pooling informati 


evement Combined Give Still Better Pre- 
on on previous school achievement 
with intelligence test score, the correlation with later school achieve- 
ment can be raised above that yielded by either factor alone. The 
two types of information supplement one another. 

5. Intelligence Tests Correlate Higher with Standardized Measures 
of Achievement Than with School Marks. Correlations between an 
intelligence test and total score on an achievement battery in the 
0.70's or even 0.80's are not unusual. Thus, for one large eleventh- 
grade group the correlation between the California Test of Mental 
Maturity and total achievement on the Progressive Achievement Test 
1 0.71, whereas for a group from grades 4, 5, and 6 
it was 0.84. Another report 7 gives correlations of 0.84 and 0.78 for 
grades 5 and 7 for the correlation between the Pintner General Ability 
Test and the Metropolitan Achievement Test. i 

6. The Degree to Which Intelligence Tests Are Related to Academic 
Success Depends Upon the Subject Matter. As one would expect, the 
more academic subjects, which depend more completely upon the 


same kinds of verbal and numerical symbols as those that bulk so 
ts, show the higher correlations. Thus, one 


ary and higher education ™ reports an 
iral science grades and 0.38 with 
rades but only 0.28 with shop 


was found to be 


large in intelligence tes 
summary of studies in second 


average correlation of 0.46 with natt 
language g 


English grades and foreign grad. 
work and 0.22 with grades in domestic science. 

The fact that intelligence tests correlate with academic achieve- 
Ment and school progress is unquestioned. From the very way in 
which the tests were assembled it could hardly be otherwise. How 
these facts should be capitalized upon in educational planning and 
individual guidance is a More troublesome matter. We will return 


to it later in the chapter. 


232 TESTS OF INTELLIGENCE 


INTELLIGENCE IN RELATION TO OCCUPATIONAL LEVEL 


We turn our attention now to out-of-school accomplishments and 
consider how intelligence test scores relate to achievement in the 
world of work. There are two types of questions that we may raise: 
(1) How do workers in different kinds of jobs compare in measured 
intelligence? (2) Within a given kind of job, to what extent is in- 
telligence related to job success? 

In relation to the first question, we have a good deal of evidence 
stemming from the testing of recruits carried out during World 


Table 9.3. AGCT Standard Scores of Occupational Groups in World War Il 


Percentile 


Occupational Groups 10 25 50 75 90 
Accountant 114 121 129 136 143 
"Teacher 110 117 124 132 140 
Lawyer 112 118 124 132 141 
Bookkeeper, general 108 114 122 129 138 
Chief clerk 107 114 122 131 141 
Draftsman 99 109 120 127 137 
Postal clerk 100 109 119 126 136 
Clerk, general 97 108 117 125 144 
Radio repairman 97 108 117 125 136 
Salesman 94 107 115 125 133 
Store manager 91 104 115 124 133 
Tool maker 92 101 112 123 129 
Stock clerk 85 99 110 120 127 
Machinist 86 99 110 120 127 
Policeman 86 96 109 118 128 
Electrician 83 96 109 118 124 
Meat cutter 80 94 108 117 126 
Sheet metal worker 82 95 107 117 126 
Machine operator 77 89 103 114 123 
Automobile mechanic 75 89 102 114 122 
Carpenter, general 73 86 101 113 123 
Baker 69 83 99 13 123 
Truck driver, heavy 71 83 98 111 120 
Cook 67 79 96 111 120 
Laborer 65 76 93 108 1 19 
Barber 66 79 93 109 120 
Miner 67 75 87 103 11 
Farm worker 61 70 86 103 115 
Lumberjack 60 70 85 100 1 16 


Adapted from N. Stewart. 


INTELLIGENCE AND JOB SUCCESS 233 


Wars I and П. Data for a selection of representative jobs are shown 
in Table 9.3. This table shows the 10th, 25th, 50th, 75th, and 90th 
percentiles on Army General Classification Test standard score (based 
on standardization with an average value of 100 and a standard de- 
viation of 20). A marked gradient is noticed from such occupations 
as accountant, teacher, and lawyer to such occupations as barber, 
miner, and lumberjack. The gradient follows fairly closely the 
educational requirements or average educational background for 
each occupation. In general, one may say that occupations select 
out individuals jointly on the basis of educational level and of in- 
telligence. Whether intelligence enters as a significant factor except- 
ing as it determines educational level is more difficult to determine. 
In any event, the net result is appreciable difference between different 
Occupational groups in performance on intelligence tests. 

While noticing the differences between groups, one must not forget 
the substantial range of score within each group. Individuals differ- 
ing widely in abstract intelligence function together in the same 
occupation. Thus, the upper 10 per cent of meat cutters did as well 
on the 4.G.C.T. as the average lawyer. The bottom 10 per cent of 
more intellectual ability than the upper 10 per 


lawyers showed no 
of group differences in average score, there 


cent of miners. In spite 
are still wide individual differences within groups. 


INTELLIGENCE AND JOB SUCCESS 

What can we say about the relationship of intelligence test score 
to success within particular jobs? A summary of the findings re- 
ported in a number of different studies is presented in Table 9.4. 
hip of Intelligence Test Score to Measures of 


Table 9.4. Relations 
Job Success 


Per Cent 


Median 
Correlation Significantly 
with Positive Number of 
Type of Job Job Success Correlations * Coefficients 
Clerical workers E 70 1 
Supervisors .40 78 
Salesmen ‚33 100 M. 
Sales clerks — .09 ó : 
Protective services 25 » ; 
Skilled workers 09. 10! К 
Semiskilled workers .20 47 
.08 31 1$ 


Unskilled workers 


Adapted from E. E. Ghiselli and C. W. Brown.” 


eG ер E i 
Significant at 5 per cent level. 


234 TESTS OF INTELLIGENCE 


With the exception of sales clerks, the median correlation is positive 
in each case. But for unskilled and semiskilled workers the correla- 
tions are quite small. Thev are higher for clerical workers, super- 
visors, and skilled workers, though only in the case of the skilled 
workers are they as high as the typical correlations with school suc- 
cess. In part this may be due to limitations in the criterion of job 
success. Whether success is measured by supervisors’ ratings, as is 
usually the case, or by some index of production on the job, the 
indicator is likely to be unreliable and biased by a number of con- 
siderations that have nothing to do with the real efficiency of the 
worker. In so far as this is true, no test given to the individual can 
be expected to predict the criterion. 

All in all, we may conclude that (1) intelligence is related to 
occupational group membership and (2) though the relationship of 
intelligence test score to job success is usually positive, it is likely 
to be quite low. Prediction of out-of-school achievement appears a 
good deal less accurate than prediction of school achievement. 


INTERPRETATION OF GROUP DIFFERENCES IN 
MEASURED INTELLIGENCE 


As soon as the first intelligence tests were developed, investigators 
started administering them to different kinds of groups and studying 
group differences in performance on the tests. "They compared the 
sexes, different age groups, groups of different racial or national 
origin, urban and rural groups, groups from different parts of the 
country, groups from different socio-economic levels, and so forth. 
The findings from these studies were fairly consistent in showing 
appreciable group differences. Lower score on intelligence EE was 
associated with lower socio-economic status, living in a ronal A 
living in the Southern or Southwestern United. St " | 
V 
these findings has been a aimes of ÿ;Ü;!' m 

à f a good 
conflict. І 


ates, being ап 


deal of confusion and 


The first naive tendency was to interpret g 


: roup differences in in- 
telligence test pe s 


rformance as indications of innate hereditary differ- 
ences between the groups in question. For example, the lc wer test 
performance of the children of laboring class parents um i m г ted 
as indicating basic genetic differences between that net n the 
white-collar group. Now, such basic genetic үе ил Dess not 


been disproved, but many lines of evidence have made psychologists 


INTERPRETATION OF GROUP DIFFERENCES 235 


much more cautious in interpreting group differences in intelligence 
test performance. Many studies have pointed out the role of life 
experience in influencing test scores and have made us realize how 
dangerous it is to make any comparison of groups whose experiences 
We shall consider some of the relevant evidence. 
The testing in the United States in World War I and in World 
War II has made possible a comparison of the level of performance 
of the military recruit population in 1918 and in 1940 to 1945. Using 
a somewhat revised edition of the 1918 Army Alpha Test with a 
sample of World War II recruits, it was possible to estimate the 
Army General. Classification Test equivalents of different scores on 
Army Alpha and thus to compare the performance of the two recruit 
It was found ? that the average World War II recruit 
nt of the World War I group. 
tive study, on a smaller scale, was made of chil- 
s of eastern Tennessee.” When 
at in 1930, it was found that 
s had risen from 82.4 to 


differ radically. 


populations. 
surpassed 83 per ce 

A similar compara 
dren in certain mountain countie 
1940 performance was compared with th 
1.Q. for children in these countie: 
This gain paralleled a very considerable 
inities in the counties in 


the average 
92.2, a gain of 9.8 points. 
increase in accessibility and cultural opportt 


question. 
Comparisons of national g eir own countries have failed 


to substantiate differences found between immigrant groups in the 
U. 510 Studies of Negro children in New York City have shown a 
tendency for the L.Q.'s to be higher for those children who had spent 
a longer time in New Vork. Studies of foster children have found 
a level of intelligence for these youngsters above what would have 
been predicted from the intelligence or social level of their biological 


Parents 
All these findings poi 


roups in th 


nt to the fact that intelligence test score de- 
pends upon experience. Where groups differ widely in experience, 
differences in test score may be expected to result. Thus, in the 
United States between 1918 and 1940 the median schooling of 18- 


year-olds increased from about In 
addition, radio sets appeared in over 80 per ce 
the country. Good roads pushed out into the rural areas, so that it 
у asy to get to town. These are only some of the 
These changes had their impact upon 
xposed to more 


815 years to about 1015 years. 
nt of the homes of 


was relatively e 
social and cultural changes. 
test performance. A more © 
experiences and perhaps especially to more 
tter on the tests. 


ducated population, e 
extensive and varied use 


of language, did be 


236 TESTS OF INTELLIGENCE 


The present discussion does not negate the significance of intelli- 
gence test differences in individuals. These differences are large even 
for individuals who have had closely similar environmental oppor- 
tunities. Environment and experience are not the whole story or 
perhaps even a major part of the story. However, the discussion 
should make us slow to accept group differences uncritically on their 
face value. It should also make us realize that in interpreting the 
performance of an individual, some allowance must be made for the 
environmental opportunity he has had. An I. O. of 90 has a rather 
different meaning for a Negro child who spent his сапу years in а 
share-cropper's cabin in the rural South from what it has for the son 
of the local banker. 


USING INTELLIGENCE TEST RESULTS IN SCHOOLS 


There are, in general, three types of settings in which standardized 
tests are used in schools, and intelligence tests should be considered 
in relation to each of these. Standardized tests тау enter into ad- 
ministrative policy as a basis for administrative decisions on such 
matters as class grouping, promotion, eligibility for certain classes 
and curricula, and the like. Standardized tests may be used by the 
classroom teacher as aids to understanding the individual pupils 
with whom he must deal and in making adaptations and adjust- 
ments to their individual needs. Tests may be used by the guidance 
staff of the school in planning the most effective use of special re- 
sources for diagnostic and remedial teaching, in helping the pupil 
and his family arrive at sound and realistic educational and voca- 
tional plans, and in helping understand personal adjustment crises 


when they arise. We may consider intelligence tests in cach of these 
contexts. 


INTELLIGENCE TESTS AND THE SCHOOL ADMINISTRATION 


Intelligence tests are likely to enter into the actions of the school 
administration either (1) through a policy of using test results ан опе 
basis for forming the group for a classroom or e i: 
tions specifying score levels that permit or re 
tion, e.g., assignment to a slow-learning cl 
bra, eligibility for a special school, etc. What is an a iate 
attitude toward administrative actions of these sorts? * 

Grouping by Intellectual Ability. The policy of forming class groups 
at least in part on the basis of the intellectual level of iie oi : a 
mains a common one. In 1947 to 1948 more than h 


(2) through regula- 
quire some special ac- 
ass, cligibility to take alge- 


pupils re- 
alf of city school 


INTELLIGENCE TESTS AND THE SCHOOL ADMINISTRATION 237 


systems reporting used ability grouping in some form in one or 
more schools. However, the procedure remains a controversial one. 
In part this is due to the varied and somewhat contradictory results 
obtained in studies of the effects of ability grouping? In part it is 
due to the variety of specific practices subsumed under the same 
label of “ability grouping" or "homogeneous grouping." In part it 
is based upon the different initial biases of those discussing the 
problem. 

It is probably impossible to make any evaluation of ability group- 
ing in general. Grouping together pupils of like mental age is only a 
first step in adapting the class program and procedures to the abili- 
ties of the pupils in the class. What is most important is what 
adaptations are made in materials and procedures after the grouping 
at attitudes exist or can be developed 
in the community toward the grouping and the adjustments that 
accompany it. The inconclusive findings in some studies of ability 
grouping suggest that in some cases the adaptations of materials and 
procedures have not been very great and that benefits do not auto- 
Matically accrue from grouping by level of mental ability. At best, 
à which facilitates further teacher adaptations 
to the pupil and the group. At the worst, it may become a substi- 
tute for study of and interest in the pupil as an individual. 

Many of both the gains and hazards of ability grouping have been 
claimed to lie in relatively intangible areas of interest, attitude, and 
adjustment. Evaluations in these areas have generally been quite 
inadequate. Thus, it is still largely a matter of opinion whether the 
bright child develops better work habits and leadership traits or 
feelings of snobbishness and superiority from being in a special class 
group. 

Ability grouping for the bul 


has been carried out and also wh 


grouping is a procedure 


k of pupils is one issue, and special 
ly extreme deviate is а somewhat different one. 
How about the highest and lowest 2 or 3 or 5 per cent in intelli- 
gence? Неге we must recognize that special administrative provi- 
sions are possible only in a community of some size. Û 5 
аге perhaps 500 children per grade in the school system, Г 15 

not be enough extreme deviates to fill a class group. The prob sut 
of the extreme deviate becomes most acute 1n the case of the low 
deviate, because of the obvious problems that the slow learners have 
in adapting to the activities and tempo of a regular е 
Special class groups have not been a universal рапасеа, but d y 
Permit adaptation of the type of class activities and the rate o 
Progress to the interests and abilities of the slower learners. 


classes for the relative 


238 TESTS OF INTELLIGENCE 


The very bright child is usually a less conspicuous problem in the 
regular class. He gets the regular work done. His boredom is less 
apparent. Furthermore, the alert teacher can often provide supple- 
mentary activities which will keep him profitably occupied. How- 
ever, there is evidence n that children of high ability who are placed 
in special groups can master the regular school curriculum more 
rapidly than they would in regular classes, or engage in a wide range 
of enrichment activities without falling behind children in regular 
schools. Furthermore, there is no real evidence that membership in 
special class groups results in undesirable personality attributes in 
these children. In view of the importance of individuals of high 
ability for our society and in view of the long period of training that 
most of them must undergo to take a role in the professional groups 
of our society, special Provisions to accelerate or enrich their carly 
training would seem to be a sound social provision where such provi- 
sions are administratively feasible, 

Intelligence Test Score as an Administrati 
gence test results may ente 


ve Prerequisite. Intelli- 
r into administrative actions if a certain 
level of intelligence is specified as a prerequisite for some action in 
relation to a pupil. Generally speaking, the relationship of intelli- 
gence test score to educational prog 
the variety of factors involved is g 
trative standards on intelligence seem rather questionable. Intelli- 
gence is often a factor that should receive consider. 
with other factors, in arriving 


Tess or success is low enough and 
reat enough so that rigid adminis- 


ation, together 
at a decision with respect to any in- 
dividual. But room for flexibility of action is needed, in the light of 
all relevant factors. Admir strative actions should provide the frame- 
work of policy within which intelligence tests are considered 
than set rigid specifications for particul 


rather 
ar educational provisions. 


INTELLIGENCE TESTS AND THE CLASSROOM TEACHER 

The classroom teacher will want to use intelligence te 
an aid to understanding each pupil in the 
school experiences that will be 
child’s level as measured by an 


st results as 
class and to providing the 
most helpful to that pupil. The 
intelligence test provides probably 
the best single clue available to the teacher as to the child's poten- 
tialities for learning the abstract symbolic aspects of the school cur- 
riculum. The test results provide a guide as to what can reasonably 
be expected of each pupil: whether the pupil should be expected to 
move along as rapidly as the rest of the class, whether the pupil’ 


8 
achievement is falling enough behind expectation to suggest the need 
for special diagnostic or remedial procedures, or whether the pupil's 


INTELLIGENCE TESTS AND THE GUIDANCE STAFF 239 


abilities are enough ahead of those of the bulk of the class so that 
the teacher should try to provide special activities and opportunities 
for enriching the regular program. 

There are certain cautions that need to be observed when the 
classroom teacher makes use of intelligence test scores for his pupils. 
Ап enumeration of the pitfalls may help the reader to avoid them. 


1. The general intelligence test, especially the group test, is a 
measure of ability to work with symbols, abstract ideas, and their 
relationships. This is one quite limited type of ability. The test 
does not encompass ability to work with things or people, or perhaps 
the ability to solve many types of concrete and practical problems. 
The child who is low on an intelligence test will probably have trouble 
with the academic aspects of the conventional school curriculum. 
However, he mav have a good level of skill or ability in the many 
non-abstract aspects of living—mechanical, social, artistic, musical. 
The teacher should seek these strengths, capitalize upon them, and 
build upon them. Above all, the teacher must recognize that intelli- 
gence test score is not a measure of personal worth and must avoid re- 
Jecting the child whose aptitude for academic pursuits is low. 

2. The verbal group intelligence test that is ordinarily used for 
school-wide testing is sufficiently dependent upon reading and arith- 
Metical skills that a low test score must be interpreted cautiously for 
a poor reader or low achiever in arithmetical skills. If possible, in- 
dividuals of this sort should be tested also with an individual test or 
test to determine whether the low ability is 


a non-verbal group à Pu. À 
ction of limited reading and 


basically correct or whether it is a refle 
number skills. ; 
3. Intelligence test results for a child whose social and cultural 
background differs radically from that of the rest of the group should 
be interpreted. with caution. The possibility of some degree of en- 
vironmental deprivation should be borne in mind. \ | 
4. If it is known or suspected that а child was emotionally dis- 
turbed at the time of testing, results should be considered quite 
tentative. Motivation and effort are needed for sound test results. 
5. The standard error of measurement should always be very real 
to the test interpreter. An 1.0. of 90 should always signify to the 


teacher I. O. somewhere between 80 and 100. 


INTELLIGENCE TESTS AND THE GUIDANCE STAFF | | 
Intelligence tests have their most obvious function in the educa- 
tional prosram rd sources of information important to persons re- 


240 TESTS OF INTELLIGENCE 


sponsible for counseling and helping the child with problems of per- 
sonal and social adjustment, making provisions for special educa- 
tional activities for him, helping him to decide on appropriate educa- 
tional objectives, and working with him to formulate vocational 
plans. In plans and decisions of all these types, it is important to 
have a clear picture of the pupil's intellectual abilities as one aspect 
of the total picture of the pupil as an individual. 

In educational guidance information about scholastic aptitude is 
especially important. This information should receive very serious 
consideration in deciding what is an appropriate educational objec- 
tive for the pupil; i.c., whether to plan for college and if so the kind 
of college to plan for, or what type of high school curriculum to 
select. In vocational counseling, more specialized ability measures. 


of the kinds we shall consider in the next chapter, are desirable as а 
supplement to the general intelligence test, but these specialized tests 
are not so important for educational planning. For understanding 
a child who is having problems in school, whether with his school 
work or his personal adjustments, an estimate of his intellectual level 
is essential. As we have indicated elsewhere, individual tests and 
non-language tests are highly desirable supplements to the usual 
group test when any rcading or language handicap is suspected. 
The specific situations and cir 


umstances under which intelligence 
tests may be used in guidance are so many and varied that they 
cannot each be discussed here. Some further consideration is given 
to tests in the guidance program in Chapter 18. 


SUMMARY STATEMENT 


Tests of ability include tests of achievement and of aptitude. 
Though aptitude tests usually depend less directly upon specific 
teaching than do achievement tests, it must be recognized that апу 
test performance is in some degree a function of the individual's 
background of experience. Aptitude tests are identified at least in 
part by their function—to predict future accomplishments. 

Among the most thoroughly explored and widely used aptitude 
tests are tests of intelligence. As these have been developed, they 
tend to emphasize abstract intelligence, the ability to deal with ideas 


and symbols, and may even be thought of as scholastic aptitude tests. 

The two main patterns of tests have been group tests and indi- 
vidual tests. Group tests, resembling the short-answer achievement 
test in format, are much more economical to use and are satisfactory 
for many purposes when the examinees are normal groups of school 


REFERENCES 241 


e individual tests have a number of ad- 


age or older. However, th 
vantages and are useful particularly with (1) voung children, (2) emo- 
ith special educational dis- 


tionallv disturbed cases, and (3) cases wt 
abiliti 

Special tests have been developed for infant and preschool groups, 
for groups with educational and language handicaps, and for groups 
from different cultures and social groups. These may be of practical 
though they serve more often as research tools. 
Intelligence test results for school-age children are about as reliable 
ical measurement tools. The widely used 
evised Stanford-Binet and the Wechsler- 
Bellevue are probably somewhat more reliable than the typical group 
test, though the differences are not large. In spite of the high reli- 
ability, appreciable shifts may be expected from one testing to an- 


value in special cases, 


as any of our psycholog 
individual tests such as the R 


other. 
When intelligence test scores are studied in relation to achievement 


in the world, the most clear-cut relationships are to academic achieve- 
it is also true that there are substantial differences 
rsons in different types of jobs. Further- 
s of jobs is related to the abstract 


ment. However, 
in test performance by pe 
more, success in at least some type 
measured by our tests. 

Group differences in intelligence (i.e, sex, гасе, age differences) 
must be interpreted quite tentatively, in. view of the differences in 
background for these different. groups. However, individual differ- 
ences in intelligence are important facts, which we need to use wisely 
ividuals in their adjustment to the world of the school 


intelligence 


in helping indi 

and of work. 
REFERENCES 

E., The limitations of infant and preschool tests in the 

J. Psychol., 1939, 8, 351-379. 

of performance tests, 2nd ed., New York, 


1. Anderson, J. 
measurement of intelligence. 
2. Arthur, Grace, 4 point scale 


Commonwealth Fund, 1943. B | e $ 
3. Bayley, Nancy, Consistency and variability in the growth of intelligence 


from birth to eighteen years, J. genel. Psychol., 1949, 255 165-196. ) | 
4. Clark W. W. Questions and answers regarding the California Test of 
Mental. Maturity, los Angeles. California Test Bureau, 1948. е " 
5. Cornell Ethel 1... Effects of ability grouping determinable from published 
studies, in The grouping ol pupils, Vat. Soc. Study Educ., 35th Yrbk., Pt. Т, 


C 9 

на F.. М. Aborn, and A. Н. Canter, The reliability of the Wechs- 

ler- Bellevue subtes ind scales, J. consult. Psychol., 1950, JJ. 172-179. 
7. Durost, W. N., and Ci A. Prescott, An improved method of comparing 

a capacity measure and an achievement measure at the elementary school 

level, Educ. Psychol. Meas., 1952. 12, 741-751. 


242 TESTS OF INTELLIGENCE 


8. Durrell, D. D., The influence of reading ability on intelligence measures, 
J. educ. Psychol., 1933, 24, 412-416. 

9. Ebert, E., and Katherine Simmons, The Brush Foundation study of 
child growth and development, I, Psychometric tests, Monogr. Soc. Res. Child 
Develpm., 1943, 8, No. 2. э 

10. Franzblau, К. N., Race differences in mental and physical traits, Arch. 
Psychol., 1935, No. 177. 

11. Gesell, A., et al., The first five years of life: A guide to the study of the 
pre-school child, New York, Harper, 1940. 

12. Ghiselli, E. E., and C. W. Brown, The effectiveness of intelligence tests 
in the selection of workers, J. appl. Psychol., 1943, 32, 575-580. 

13. Goodenough, Florence L., Measurement of intelligence by drawings, 
Yonkers, N. Y., World Book, 1926. 

14. Goodenough, Florence L., and Katherine M. Maurer, The mental 
growth of children from two to fourteen years; a study of the predictive value 
of the Minnesota Preschool Scales, Univ. Minn. Inst. Child Welf. Monogr.. 
1942, No. 19. 

15. Goodenough, Florence L., Katherine M. Maurer, and M. J. Van Wag- 
enen, Minnesota Preschool Scales: Manual of instructions, Minneapolis, Minn., 
Educational Test Bureau, 1940. 

16. Honzik, Marjorie P., Jean W. McFarlane, and Lucille Allen, The sta- 
bility of mental test performance between two and eighteen years, J. exp. 
Educ., 1948, 17, 309-324. 

17. Justman, J.. A comparison of the functioning of intellectually gifted 
children enrolled in special progress classes in the junior high school, unpub- 
lished doctor's dissertation, Columbia University, 1953. 

18. Klineberg, O., Negro intelligence and selective migration, New York, 
Columbia University Press, 1935 

19. National Education Association, Research Division, Trends in city 
school organization, 1938 to 1948, Res. Bull., 1949, 27, 4-39. 

20. Rulon, P. J., A semantic test of intelligence, in Proceedings of the 1952 
Invitational Conference on Testing Problems, Princeton, N. J., Educational 
Testing Service, 1953. 

21. St. John, C. W., Educational achievement in relation to intelligence 
as shown by teachers' marks, promotions and scores in standard tests in cer- 
tain elementary grades, Harvard. Univ. Stud. Educ., 1930, Vol. 15. 

22. Skodak, Marie, Children in foster homes: А study of mental develop- 
ment, Univ. Ja. Stud. Child Welf., Vol. 16, No. 1, 1939. 

23. Stewart, Naomi, A.G.C.T. scores of army personnel grouped by occu- 
pations, Occupations, 1947, 26, 5-41. 

24. Stutsman, Rachel, Mental measurement of pre-school children, with a 
guide for the administration of the Merrill-Palmer Scale of Mental Tests, Yonkers, 
N. V., World Book, 1931. 

25. Terman, L. M., and Maude 
to the administration of the new r 
New York, Houghton Mifflin, 1937. 

26. Thorndike, R. L.. The prediction of intelligence 
from earlier tests, J. educ. Psychol., 1947, 38, 129- 148. 

27. Tuddenham, R. D., Soldier intelligence in World Wars I and II 
Psychologist, 1948, 3, 54-56. 


А. Merrill, Measuring intelligence: a guide 
ised Stanford-Binet Tests of Intelligence, 


at college entrance 


‚ Amer. 


QUESTIONS FOR DISCUSSION 243 


28. Wechsler, D., The measurement of adult intelligence, Baltimore, Williams 
& Wilkins, 1944. 
29. Wechsler, D., Wechsler Intelligence Scale for Children: Manual, New 


York, Psychological Corp., 1949. 
30. Wheeler, L. R., A comparative study of the intelligence of east Tennes- 
see mountain children, J. educ. Psychol., 1942, 33, 321-334. 


SUGGESTED ADDITIONAL READING 


Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ters 8-12. 

Eells, Kenneth Walter, et al., Intelligence and cultural differences, Chicago, 
University of Chicago Press, 1951. 

Goodenough, Florence L., Mental testing, New York, Rinehart, 1949, Chap- 


ters 2, 20-22, and 29. 
Monroe, Walter S., Editor, Encyclopedia of educational research, New York, 


Macmillan, 1950, pp. 600-612, 874-894. 
Stoddard, George D.. The meaning of intelligence, New York, Macmillan, 


1943, Chapters 1, 11, 12, 15, and 16. 


QUESTIONS FOR DISCUSSION 


; been proposed that all intelligence tests should really be called 
What are the merits and the limitations of this 


1. It ha 
scholastic aptitude tests. 


proposal? 
2. Why is it better to depend upon a good intelligence test for an estimate 


of a pupil's intelligence than upon ratings by teachers? 

3. In each of the following situations would you elect to use a group in- 
telligence test or an individual intelligence test? Why? 

à. You are studying a boy with a serious speech impediment. 

b. You are selecting students for a school of nursing. à . 

с. You are preparing to counsel a high-school senior on his educational and 


vocational plans. ч К А Р 
d. You are making a study of the Mexican children in a school system in 


Arizona. Ou * ae 

е. You are working with a group of delinquents in a state institution. 

4. In which of the following situations would you choose routinely to use 
the Arthur Point Scale instead of the Stanford-Binet? Why did you decide 
as you did? 

a. For testing Puerto Rican children entering school in New York City. 

b. For selecting children for a special class of gifted children. 

€. For evaluating intelligence in a school for the deaf. 

d. For studying children who have reading problems. 

5. What are the implications for child placement agencies of the data on 
infant tests presented on p. 219? / Г 

6. Why do two different intelligence tests given to the 

" E H m3 *a? 
frequently give two different EQ. Sf 


same pupil quite 


244 TESTS OF INTELLIGENCE 


7. Are the usual group intelligence tests more useful for guidance for 
professional occupations or for skilled occupations? Why? 

8. A news article reported that a young woman who had been committed 
to a mental hospital with an 1.0). of 62 had been able to raise her LQ. to 118 
during the 3 years she had spent there. What is misleading about this news 
statement? What factors could account for the difference between the two 
I.Q.'s? 

9. What advantages do intelligence tests have over high-school grades as 
predictors of college success? What limitations do they have? 

10. Why do intelligence tests show higher correlations with standardized 
achievement tests than they do with school grades? 

11. Comment on the statement: “College admissions officers should dis- 
count scholastic aptitude test scores of applicants who come from low socio- 
economic groups." 

12. You are a fourth-grade teacher. You have given a group intelligence 
test to your class and gotten I.Q.'s from it. What sorts of specific actions and 
plans might you base on the results? 

13. In a child guidance clinic, the mental tester who gave the Stanford- 
Binet made a practice of always computing I.Q.'s to one decimal place (i.c., 
95.6). What advantages and what dangers do you see in this procedure? 

14. A school in a prosperous community gave Stanford-Binet intelligence 
tests to all entering kindergartners and first graders within the first week or 
two of school. How desirable a procedure is this? Why? 


Chapter 70 


The Measurement of Special 
Aptitudes 


It was not long after the development of tests of general intelli- 
gence that psychologists became interested in the development of 
rod abilities. In part, the movement for the 
cialized tests grew out of practical studies 
ss in jobs. Thus, analysis of 
r example, indicates that the 
ily, in 


tests of more special 
development of more spe 
of jobs and attempts to predict succe 
the work of an automobile mechanic, fo 
demands of the job are not entirely, or perhaps even prima 
the domain of abstract symbolic intelligence, but that understanding 
of mechanical devices, the ability to visualize the relative move- 
ments of parts, and other sorts of special abilities appear to be in- 
volved. So, spurred on by this practical concern, ps rchologists pre- 
chanical comprehension, spatial visualizing, 


s of me 


pared various tes 
clerical speed and accuracy, and manual dexterity. 

A related concern was in the pupil's readiness to undertake certain 
aspects of the school program. As educators became aware of in- 
dividual differences in their pupils and of the futility of pitting a 
pupil against work that was bevond his abilities, they began to ask 
which pupils were ready to start à formal reading program, which 
were suitable candidates for the algebra class, or which should take 
shorthand. Though general intelligence tests were of some value 
stions, further tests were developed that had 
prognosis of readiness for, or probable suc- 
ducational program. A similar con- 
candidates for advanced pro- 
lopment of entrance test bat- 


in answering these que 
as their specific purpose 
cess in, a particular part of the e 
cern for picking the most promising 
fessional education has led to the deve 
teries tailored for particular professional schools. М5 : 
At the same time, persons interested in the special fields of music 
rtook to measure the special abilities necessary 
It has long been recognized that musical 
highly specialized, and it was believed 
245 


and graphic art unde 
for success in those fields. 
and artistic talents are quite 


246 THE MEASUREMENT OF SPECIAL APTITUDES 


that it should be possible to develop tests for at least some of the 
components of such talent. 

Paralleling practical interest in selecting persons for particular 
types of jobs or particular programs of training, there was a theoreti- 
cal interest in analyzing the various types of test tasks that had been 
developed as measures of intelligence. Studies based on the statis- 
tical procedures of factor analysis confirmed the practical interest 
in special ability tests by indicating that a fair degree of specializa- 
tion does in fact exist in intellectual abilities. The verbal tests cor- 
relate more highly with each other than they do with the quanti- 
tative tests, and vice versa. Verbal and number tests are somewhat 
distinct from those that use spatial materials. Memory tests are 
different from reasoning tests. A number of factors appear to be 
distinguishable in the intellectual domain. 

In this chapter we will consider first some of the more specialized 
tests and test batteries developed for employee selection and voca- 
tional guidance. Then we will consider prognostic tests and pro- 
fessional-school aptitude batteries. Finally, we will examine some of 
the attempts to appraise special aptitudes in music and art. 


VOCATIONAL TESTS 


Quite early in the history of standardized te 
made to apply tests to the practical proble 
employment in particular jobs. Since each job is to a degree unique 
in the duties that it includes and consequently in the abilities it 
appears to require, a testing program needs to be reviewed for the 
specific job. The sequence of operations by which a job selection 
battery is set up is outlined below. 


sting, attempts were 
ms of selecting men for 


SETTING UP A SELECTION TESTING PROGRAM 


Building a testing program for a job involve: 


s the following sequence 
of steps: 


1. Analysis of the Job to Determine the Aptitudes Required. In or- 
der to get initial “hunches” as to what sorts of tests are likely to be 
valid predictors of job success, the test maker must get acquainted 
with the job. Job analysis proceeds in various ways—by observing 
men working on the job; by talking to workers and discussing what 
they do, how they do it, and what difficulties they encounter in 
doing it; by talking to supervisors: and by actually ‘trying to learn 
the job. From these various sources the personnel psychologist gen- 
erates hypotheses as to abilities for which tests should be developed. 


SETTING UP А SELECTION TESTING PROGRAM 247 


2. Development of Tests to Measure the Significant Abilities. Ini- 
tially, many of the tests for vocational selection were local products, 
developed to meet the needs of a particular selection situation. A 
number of these have proved of sufficiently general interest so that 
they have survived and become part of the pool of commercially 
available tests. At the present time, the personnel psychologist will 
frequently find a commercially distributed test that appears to fill 
his need. In other instances, he may still need to construct tests 
that appear to be particularly called for by the job with which he 
is working. 

3. Administration of the Tests to a Sample of Job Applicants. The 
tests are still of unknown value. Their value must be established 
by tryout. Ideally, they should be tried out on job applicants like 
those with whom they will later be used. As a practical compromise, 
tests are sometimes tried out on men who are already working on 
In this case, however, it is hard to know how much the 


the job. 
rences in motivation may distort 


effects of job experience or diffe 
test performance. The man who does well on the tests after he has 
had years of job experience may not be the same one who would 
have done well before entering upon the job. 

4. Determination of Success on the Job for Each Examinee. Some 
criterion measure of job success must be obtained for each person 
tested. This sounds fairly simple and straightforward. However, 
it is probably the most troublesome part of personnel selection re- 
search. To get reliable, unbiased, relevant measures of each man's 
success on the job usually proves to be a very imposing undertaking. 
For some jobs actual performance records may be available. How- 
ever, these will often be distorted by factors outside of the employee's 
control. Thus, sales of soap depend upon the sootiness of the at- 
as on the skill of the salesman. More often, it 
atings by supervisors. We will con- 
detail in Chapter 13. At present, 
ak reed on which to have 


mosphere as much 
is necessary to fall back upon г 
sider the problem of ratings in some 
we need only indicate that they are a we 
to lean. 

5. Analysis of the Correlation о, 


ure. Once we have gotten the : 
genuity and the practical situation will permit, we must correlate 


each predictor test with the criterion. The validity coefficient of 
each test taken singly will indicate the value of this test by itself, 
However, more elaborate analyses, which depend upon the inter- 
correlations of the tests as well as the validity of each, are necessary 
if we are to pick the best team of tests and put them together in the 


f Test Scores with the Criterion Meas- 
best criterion measure that our in- 


248 THE MEASUREMENT OF SPECIAL APTITUDES 


most effective way. From this analysis we determine an appropriate 
weighting for each test (regression weight) and also the validity of 
the composite of tests (multiple correlation). Those tests that 
give the most promise can be retained for later operational use. 


COMMERCIALLY DISTRIBUTED SELECTION TESTS 


In the course of time, research relating to particular jobs or types 
of jobs has yielded an assortment of tests having enough general 
interest to merit publication and general distribution. We shall 
consider briefly types of tests relating to mechanical aptitude and to 
clerical aptitude. 

Mechanical Aptitude Tests. Many varicties of tests have been de- 
veloped that are designed to pick men for, or guide youngsters with 
respect to, entering mechanical jobs. Опе of the early developments 
in this area was a type of mechanical assembly test in which a num- 
ber of common "gadgets" are presented in pieces and in which score 
is based on the speed and correctness with which they are put to- 
gether. Tests measuring knowledge about tools, their names and 
what they are used for, represent another type of test in the mechan- 
ical area. Other tests have been based more on understanding of 
simple mechanical principles. These present simple situations in- 
volving levers, pulleys, siphons, and common elementary physical 
principles, usually in the form of pictures, and test the examince's 
knowledge by asking about relationships or calling for prediction of 
outcomes. Thus, a car may be shown going around a curve, and the 
examinee is asked to which side it would be likely to skid, or a dia- 
gram of a dam is shown, and the examince indicates at which of 
two points on the dam the water pressure is greater. 

Spatial and visualizing abilities appear related to performance in 
some mechanical jobs, and several tvpes of tests of these functions 
fe a e к up та UU i ME MS e 

е ability to see how a whole figure сап be 
assembled from its parts (1), how a three-dimensional figure is re- 
lated to a two-dimensional pattern (II), how an object will appear 
when seen from a different point of view (III), or how movements 
of one part are related to movements of another (IV). Within the 
area of spatial and visualizing ability, several more or less distinct 
abilities appear exemplified by these different sorts of tests. 

A further large group of tests that are related to tests of mechani- 
cal ability are tests of motor coordination and manual dexterity. 
There are many specific tests in this field, tests of speed of tention! 
tests of finger dexterity in fine movements, tests of rate of manipula- 


COMMERCIALLY DISTRIBUTED SELECTION TESTS 249 


tion, tests of two-hand coordination, and tests of eye-hand coordina- 
tion of gross movements. Study of the various motor tests indicates 
that the correlations among them are quite low, lower than those 
that characterize different types of intellectual ability. There does 


not appear to be any general factor of motor skill but rather many 


Talas o}] AA 
A A|AA|VA|ZA [AA 


m 
— 


Directions: 


X 
I I. Which set of pieces corresponds to the figure at the 
far left? 


| Which flat pattern could be folded to make the 


object shown at the far left? 
Each side of the box has a different pattern. 


Iv m. 
Which lettered box can be the same as the one 
п at the far left? 
B IV. When | moves in direction X, does Il move toward 
e Stationary pivot AorB? (Ans. B) 


© Moveable connection 


Fig. 10.1. Types of spatial and visualizing tests. 
In selection research, it becomes necessary 
tor skills, if any, that are important ол 
the job. Thus, shrewd job analysis and extensive tryout are often 
necessary before a motor test is found that shows validity. Generally 
speaking, motor tests have shown less promise in personnel selection 
than have the tests of more intellectual functions. 

Mechanical aptitude tests published up to about 1942 are reviewed 
by Bennett and Cruickshank, and these same authors have prepared 


à review of clerical tests.” 


different specific skills. 
to determine the specific mo 


250 THE MEASUREMENT OF SPECIAL APTITUDES 


Clerical Aptitude Tests. A number of tests for clerical workers 
have been produced. Upon closer scrutiny, however, most of them 
are found to consist in large part of tasks of the same sort as those 
included in the typical intelligence test. As noted in Table 9.5 (p. 
233), clerical jobs are ones for which intelligence tests have been 
found to have relatively high validity, so it is not surprising that 
the same tasks appear under the guise of "clerical aptitude tests." 

The additional component that seems to enter into a number of 
clerical jobs is speed and accuracy of perceiving clerical detail. 
Probably the best-known test of this function is the Minnesota Vo- 
cational Test for Clerical Workers. This test has two parts, name 
checking and number checking. In one the subject has a long col- 
umn consisting of pairs of names; in the other, a column of pairs 
of numbers. He is to check the two members of cach pair to see 
whether they are exactly the same or whether there is some differ- 
ence between them. The test is given with a very short time limit 
and is scored for speed and accuracy. The types of items are illus- 
trated below: 


Number Checking: 8637 8673 
25946 25946 
73987 73897 


Name Checking: Walter C. Jones Walter G 
The Carstairs Coal Co. The Carstairs Coal Co., Inc. 
Howard D. Van Deusen Howard D. Van Deusen 


This test has shown promising validity for a number of clerical jobs. 


APTITUDE TEST BATTERIES 


In the beginning, tests for use in personnel selection and voca- 
tional counseling were developed as separate unrelated undertakings 
—a mechanical knowledge test here, a manual dexterity test there, 
a test of spatial visualizing somewhere else. But these separate 
and unrelated tests were somewhat unsatisfactory to use. On the 
опе hand, it was difficult to assemble a well-rounded set of tests to 
cover the range of abilities significant for a program of guidance 
or personnel classification. On the other, the norms for different 
tests were not based on the same sorts of groups, so it was not pos- 
sible to treat norms from the different tests as equivalent. 

In response to these two problems, testing agencies have in recent 
уеагв developed integrated guidance and classification batteries. Sev- 
eral such batteries were developed by the Armed Forces during and 
after World War П to aid in assigning recruits to the many different 


APTITUDE TEST BATTERIES 251 


types of specialized training and the many specific assignments that 
exist in a military organization. There they made a substantial con- 
tribution to efficient use of military manpower. Others have been 
developed for civilian use. This type of battery will be illustrated 
here by the Differential Aptitude Test Battery (DAT) published by 
the Psychological Corporation as a guidance battery for use with 
secondary-school youngsters. 

The Differential Aptitude Test Battery. The DAT is made up of 
eight subtests vielding separate scores. The tests are identified by 
title, and the nature of each is made clear by an illustrative item in 
Fig. 10.2. The tests are not independent. With the exception of 
the test for clerical speed and accuracy, the intercorrelations run 
about 0.50. But the reliabilities of the separate tests are enough 
higher than the test intercorrelations to assure us that each test 
measures abilities somewhat distinct from those measured by the 


others. 
The extent to which the 
ferent aspects of educational success is sugg 


different abilities are significant for dif- 
ested by Table 10.1. 


Table 10.1. Median Correlation of Differential Aptitude Test Scores with 
School Grades in Different Subjects * 


Soc. Stud., Lan- Short- 
Test English Math. Science History guages Typing hand 
Verbal Reasoning (VR) sont 39 02) 5+0) 50 (1) 30 (% 19 (6) 44 (3) 
Numerical Ability (NA) 48 (3) 50 00 51 (2) 48 (2) 420) 32 (1) 27 (4) 
Abstract Reasoning (AR) dec) ah) deu) 380) 50) 27 (3) 2465) 
Space Relutions (SR) 27 (6) 32 (5) 3606.5) 26(65) 1508) 16 (7) 16 (6) 
Mechanical Reasoning (MR) 24 (L8) 22 0) 38 (5) 24 (8) 170) 148) 14 (7.5) 
Clerical Speed & Accuracy 
(CSA) 24(.5) 198) 20 (9 26(6.5 23 (6)  26(45) 14 (7.5) 
Spelling (Spell) "mm 20 (6) 36(65) 30 (4) 31(3) 26 (4.5) 55 (1) 
Sentences (Sent.) 521) 36 (3) 48 (3) 46 (3) 40 (2) 30 (2) 49 (2) 


* The entries in this table аге correlation coefficients. Decimal points have been dropped to save 
Thus 50 should be unders! »od to mean 0.50, etc. 


space, 
s rank of that test for that subject. 


T Number in parentheses show: 


This table, based on data presented in the DAT manual, shows the 
median correlation between tests of the Differential Aptitude Test Bat- 
nt secondary school courses. The ranks of 


tery and success in differe 
also shown for each subject. 


the different tests in the battery are 
The tests showing the highest correlations tend to be a good deal 


the same for the different school subjects. Thus, the Verbal Reason- 
ing Test ranks near the top for all subjects but typing. We see that 
Numerical Ability is one of the better tests in each instance. The 
Space Relations Test is one of the less valuable for each of the aca- 


THE MEASUREMENT OF SPECIAL APTITUDES 


LLLI) ЕР 


S330914 YAMSNV SAHNI K3THONd 
aso} Jo ouou Я asaq} Jo əuou qJ 
8 d 6s dq 
91 2 — 9t 2 A 
97 a 02 є Я 21 
91 v oe pPenqns "у tI PPV 
A SIdNVX3] X ялайухя 
Aaqsnput d роо; O Awəuə 'g QAP ‘VW 
Hd т 300) 6 шр @ enuruoo `I 
e 0} SF que su I9jEA 0} SU ^^^ X AAN 


suoneja1 oed 


Buluoseas 
yoensqy 


jeouawinn 


Suiuoseas 
1eq13A 


253 


APTITUDE TEST BATTERIES 


‘ssa epniudy joyuaseyig wosy swap əjdwos F- `бы 


a a 9 a v saouayuas 
"I? 38 / ҳәәм 3xeu / NYO / ay} 03 2ш03 / әл тушу agesn eSengue] 


vA 
ung X 
оношм de 
EROS 1" пеше SAN, Зицәдѕ 
Noum анон 
= әЗеѕп әЗеп3иет 
LIGHS YIMSNY dO ATANVS SATANVXY 


Кэелпоэе pue 
paads үезиә|2 


Laas YMSNY 30 ATANVS 


Co зеш ‘enbo J1) 
срео] Dey Jy) sey иеш YOM 


Suiuosea1 
1еотиецоәуу 


х 


254 THE MEASUREMENT OF SPECIAL APTITUDES 


demic areas. However, variations can also be noted. The Me- 
chanical Reasoning Test is more important in science than in the other 
areas. The Spelling Test comes into its own in predicting success in 
shorthand. The Numerical Ability T'est is more valid for mathematics 
than it is for English. The Clerical Speed and Accuracy Test is of 
value only for typing. The specific tests do have a certain amount 
of differential significance. 

Evidence on the predictive significance of the different ability tests 
for later vocational careers is harder to come by. A follow-up of 
high-school students tested with the DAT was undertaken about 2 
vears after thev had completed high school. The percentile equiva- 
lents of the average man in cach group that contained 20 or more 
cases are shown in Table 10.2. Thus, premedical school students were 


Table 10.2. Percentile Equivalents of Average Scores on the DAT for 
Men in Various Educational and Occupational Groups 


Percentiles 


Group No. VR NA AR SR MR CSA Spell. Sent. 


Degree-Seeking Students 
Premedical 24 88 86 BI 72 77 77 90 90 
Science (Biology, Chemistry, and 
Math.) 25 81 85 60 67 68 74 81 72 
Engineering (includes Architectural) 70 80 86 80 81 82 67 68 74 
1 


Liberal Arts (includ Prelaw) 68 79 75 78 61 64 75 78 8 
Business Administration 64 72 73 61 63 60 71 67 68 
Education (includes Physical Edu- 


cation) 25 68 66 SR 57 48 68 67 66 
Various: Predental, Agricultural, ete. 30 64 67 74 60 73 59 55 64 
Non-Degree Students in Two-Year 


Schools 

Business, Technical Fine Arts, ete. 43 „ 72 s 6 54 61 
Employed 

Salesmen 23 S6 53 53 32 57 44 50 40 
Clerks: General Office Work e > SFE ж хх caf © ұз 
Mechanical, Electrical, and Build- 

ing Trades 66 34 41 46 49 56 44 28 20 
Various Skilled: Butcher, Baker, etc. 26 47 — 37 — 43 41 4% sl 52 45 
Various Unskilled: Truck Driver, 

Laborer, etc. 85 35 30 36 42 49 37 35 36 
Military Service 129 46 42 46 51 50 40 46 41 
Unclassified 

No Consistent Work or School 
Record 58 53 48 51 54 54 46 47 52 


Reprinted by permission of the Psychological Corporation. 


high on all tests, but highest bv a small margin on the verbal tests. 
Education students were somewhat above the average of the norm 
group on all except the mechanical test. Workers in electrical, 
mechanical, and building trades were above average only in the me- 
chanical test. Many other differences тау be noted in the table. 


OTHER APTITUDE TEST BATTERIES 255 


It must be remembered that the groups are small, that variation 
wilhin an occupation is large, and that these data do not indicate 
how successful the person is in the occupation. However, they pro- 
vide some cue as to the predictive significance of the tests. 


OTHER APTITUDE TEST BATTERIES 
ther aptitude test batteries are available, designed 


A number of o 
These will be listed and 


to be used for guidance and placement. 
commented on briefly below. 

General Aptitude Test Battery (GATB). This battery of 15 tests 
S. Employment Service. Its use is limited 
to State Employment Services; the battery is not made available 
to the public. Scores from the separate subtests are combined into 
factor scores designated intelligence (G), verbal aptitude (V), numer- 
ical aptitude (N), spatial aptitude (S), form perception (P), clerical 
perception (Q), aiming (A), motor speed (T), finger dexterity (F), 
and manual dexterity (M). Recommended minimum scores are 
given for many different occupations. Evidence on validity was ob- 
tained by testing groups of persons already working in the occupa- 
y small, consisting of between 30 and 


was prepared by the l 


tion. The groups are usuall 
100 cases. 
Flanagan A pritude Clas 


ification Tests (FA CT). These tests were 
published in 1953 by Science Research Associates. The battery 
consists of 14 separate tests covering a variety of rather independent 
verbal, quantitative, perceptual, memory, and motor abilities. 
e indicated for 
for some 30 occupational groups. How- 
ever, little evidence of the validity of the specific tests is offered. 
The basis for the recommendations appears to be largely the author's 
judgment, supported by collateral evidence from other tests bearing 
some resemblance to his. Most psychologists would take exception 
ations based on such limited and indirect 


Weighting procedures ar combining the subtests to 


vield predictions of success 


to the sweeping interpret 
evidence. 

Factored Aptitude Series. This battery, produced by Joseph King, 
Industrial Psychology, Inc., between 1947 and 1950, also undertakes 
to measure the various ability factors identified in previous test 
sults to industrial selection. The tests 
and may have validity for some pur- 
tests tends to be assumed by 


search and to apply the re 
are attractive and well planned 
validity of the 
d on the basis of empirical evi- 


re 


poses. However, the 
the author rather than demonstrate 


dence. 


256 THE MEASUREMENT OF SPECIAL APTITUDES 


Guilford-Zimmerman Aptitude Survey. The seven parts of this bat- 
tery, brought out by the Sheridan Supply Company between 1947 
and 1950, cover presumably distinct abilities in the areas of abstract 
intelligence, clerical aptitude, and mechanical aptitude. Other tests 
were originally planned and may be published subsequently. As with 
the batteries just discussed, little empirical evidence of validity is 
provided, though claims for validity are perhaps more modest for 
this battery. The tests appear workmanlike, but their value remains 
to be demonstrated. 


THE PROBLEM OF VALIDATION AGAINST SUBSEQUENT JOB SUCCESS 

For all the aptitude batterys that we have reviewed we have 
found a dearth of empirical evidence on the validity of the tests as 
predictors of later job success. This is not too surprising. Follow-up 
of individuals who have been tested to see which ones subsequently 
go into what sorts of occupations and which ones are successful in 
each occupation is a laborious and time-consuming operation. It 
requires a wide-ranging program of testing, elaborate arrangements 
for follow-up, and the lapse of a number of years. It is natural that 
no studies of this sort have been carried through on a scale and in a 
manner to provide convincing evidence on the significance of specific 
abilities for success in specific occupations. 

Some limited data * from a 10-year follow-up of a sample of 1500 
men tested as applicants for air-crew training in the Army Air Force 
in World War II are shown in Fig. 10.3 to illustrate the sort of evi- 
dence that would be desirable on a much larger scale. The men 
were tested in 1943, and the occupations are those in which they 
were engaged in 1953. The bars on the chart show level of perform- 
ance in comparison with the total group of applicants, a group about 
comparable to entering college freshmen. 

The high scores in reading comprehension shown by the engineers 
and lawyers reflect in part the demands imposed by their training. 
Engineers and lawyers as groups are high in ability to understand 
verbal material because nobody could get into those professions with- 
out a good deal of verbal ability. The profiles for lawyers and engi- 
neers differ somewhat, the engineers making a better showing on 
mathematics, mechanical comprehension, and the tests of coordina- 
tion, whereas the lawyers are more specifically verbal. However, 
the striking differences come when we compare cither professional 
group with the telephone installers or with the protective service 
workers (policemen, firemen, etc.). These two latter groups are 
much lower on the more intellectual and academic measures. How- 


THE PROBLEM OF VALIDATION AGAINST SUBSEQUENT JOB SUCCESS 257 


ever, they show up well on the mechanical test and are above average 
in tests of motor coordination. The groups shown in Fig. 10.3 are 
at were identified in this follow-up study. 


four out of many th 
ated picture of the value of tests for discriminat- 


There is no integr 
ing more successful fr 
of this sort have been carried out on a piecemeal bas 
They have used a great range of specific 


om less successful workers on jobs. Studies 
s by investigators 


scattered over the country. 
tests in a wide variety of jobs, and the findings are hard to piece 
together into any meaningful picture. Ghiselli ? has attempted to 
many unrelated specific studies, pool- 


summarize and integrate these 
and similar types of jobs. The 


ing data on similar types of tests 
pooled results are presented in part in Table 10.3. 
Table 10.3. Validity of Different Categories of Tests for General 
Categories of Jobs 
(Pooled results from available studies from Ghiselli ?) 


Туре of Job 


Protect. Skilled Semi- 
Clerical Service Trade skilled Unskilled 


General intelligence .36 28 45 .20 16 
Arithmetic .42 =12 15 
Number comparison 28 25 15 15 
Spatial relations .06 .45 .30 27 
Mechanical principles 45 .25 

22 .20 21 .30 .05 


Finger dexterity 


This table is incomplete because ordinarily investigators tried out 
only the kinds of tests that seemed to them appropriate for a partic- 
ular occupation. Thus, there were no data for mechanical compre- 
hension tests for clerical workers. However, the table does give some 
notion of the level and degree of specialization of test validity co- 
efficients when the tests are evaluated against some measure of 
For most categories of job, correlations of 0.30 
age value that is obtained. For 
the highest correlation. For skilled 
oined by mechanical and spatial 
lations is joined by finger dex- 
ar for the protective workers 


on-the-job success. 
to 0.45 represent the highest ave 
clerical jobs, arithmetic tests give 
trades, general intelligence tests are j 
tests. In semiskilled jobs. spatial re 
terity. No very high correlations appe 
and unskilled categories, but general intelligence is highest in the 


first case and spatial relations in the second. 


THE MEASUREMENT OF SPECIAL APTITUDES 


258 


SJ9ÁME"] £T :pauasaıday dnoj9 


S188UI8U3 tt :pajuasaiday dnour 


иођешроод зозошоцокѕа 


иођешроод риен = ому 


Sa|diouug |еотиецэәуу 


BuizyenstA jeneds 


әЗрәјмоиу Yew 


Suiuoseay onauiuuy 


uoisueueJduio?) suipeay 


рәәйс jenjde2jed 


001 05 0 os- Q0I- 00ї 05 0 05 00ї—__1531 
(001 = UOHE!IN8G pjepuelS ‘0 = әЗеәлу) (OOT = иоцеләп рериеѕ ‘0 = ә8еләлү) 
39095 ачуаму1$ 13аУ2 39008 QuvaNvIS 13Q0v9 


259 


THE PROBLEM OF VALIDATION AGAINST SUBSEQUENT JOB SUCCESS 


(e 4equaceg pejsej) ‘sdnoı6 үоцоцойпэзо ZG6| jO so1o2s Is uonooyisso[? jopo» иоцоо әбозәлу eg 


SIBYIOM BIIMAS 9A]23]04g LT :payuasaiday dno 


[ 


] 


ueuisaur] $ $1ә|үе}$и| euouda|a] 9p :pajuaseiday dno 


uoijeurpa00?) 100fſö sg 


uogeupioog риең - ому 


sejdiouug jeoiueu2aW 


SuizijensiA jeneds 


aBpajmouy ew 


Buiuoseay onauiuuy 


uoisuauaJduio?) Suipeay 


paads jenjdaosag 


001 05 0 os- 


001 — 


QOT OS 0 08 001=_ 1531 


(001 = чоцегләд ргериез$ ‘0 = әЗеәлу) 


3409S ачуаму1$ 130) 


(001 = uoge paepuejs ‘0 = Se 
3409S ачуаму15 130ү 


260 THE MEASUREMENT OF SPECIAL APTITUDES 


These data are crude and suggestive at best. They аге blurred 
by (1) the grouping of tests, (2) the grouping of jobs, and (3) the 
problem of getting any satisfactory appraisal of success in a job. 
However, they give some rough picture of trends in aptitude test 
validities against measures of job success. 


PROGNOSTIC TESTS 


One group of aptitude tests is made up of tests designed to predict 
ss in some specific sub- 


readiness to learn or probable degree of succ 
ject or segment of education. These are called prognostic tests. 
A group of tests in this category that have been widely heralded and 
have received considerable use are the "reading readiness" tests. 


Table 10.4. Types of Tasks Included in Representative Reading 
Readiness Tests 


Lee- Metro- Murphy- 
Туре of Test Task Gates Clark politan Stevens Durrell 
Oral vocabulary or direc- 
tions, using pictures x х х 
Rhyming or matching 
sounds x x 
Visual matching of figures, 
letters, or words x x x x x 
Visual perceiving of figures, 
letters, or words (“Which 
one is different?) x х 
Learning words in a stand- 
ard lesson N N 
Ability to read letters and 
words x 


These tests are designed to be used with children, usually shortly 
after their entry into the first grade, to give the school as accurate 
an indication as possible of the child's ability to progress in reading. 
They provide information the teacher can use in assembling work- 
ing groups within the class, in deciding upon the amount and type 
of prereading activities to provide, and in judging how soon to start 
a formal reading program. In some communities where kindergarten 
attendance is quite general, tests at the end of kindergarten are looked 
to as one basis for organizing first-grade groups for the following 
year. The sorts of tasks that appear in these tests may be seen from 
Table 10.4. 


PROGNOSTIC TESTS 261 


The reader who compares the tasks in Table 10.4 with the sample 
intelligence test items shown on pp. 205-207 will be aware of a sub- 
stantial degree of similarity. In both, knowledge of word meanings 
appears. Both deal with recognition of sameness and differences, 
with analvsis and classification. However, the reading readiness tests 
tend to emphasize more exclusively the materials of reading, letters 
and words. They include the components or early stages of the 
reading task. The basic question now becomes: Does the special 
slant which is given in the reading readiness test result in increased 
validity? Is the special test an improvement over a measure of 
general or academic aptitude? This is the question that must be 
raised for any type of prognostic test or special aptitude test. 

Whether a reading readiness test provides a better guide to later 
ess than does a general intelligence test remains some- 
One fairly extensive investigation * indicates that 
rceive апа match words, to complete a 
story, and to select rhyming words gave better prediction of reading 
achievement one, two, or three terms later than did Stanford-Binet 
mentalage. The validities reported for the Gates Reading Readiness 
T'est, developed on the basis of this research, have been about 0.70, 
whereas Stanford-Binet M.A. showed a correlation of only 0.40 in 
This would indicate that the test tasks closely 
cd by the beginning reader do have higher 


reading succ 
what unclear. 
tests requiring pupils to pe 


the original study. 
resembling the tasks fac 
predictive effectiveness. 


Another set of data“ indicates, by contrast, that the Pintner-Cun- 


ningham Intelligence Test had a higher correlation with sixth-grade 
reading achievement than did the Lee-Clark Reading Readiness Test. 
However, these two sets of results need not be considered contradic- 
tory. The reading readiness test undertakes to predict ability to 
profit from reading instruction in the near future and is not used to 
1 of reading achievement. It may well be that 
an indicator of progress in reading within the 
lligence test is a better indicator 


forecast ultimate leve 
it is more effective as 
next few months, even though an inte 
of ultimate level of reading achievement. ) 

Prognostic tests have been developed for various other subjects 
and levels. However, these tests have not been very popular in 
When one is dealing with the usual academic areas, 
foreign languages, it is a question whether 
the predictions based on meas- 


recent years. 
Le., algebra, geometry, 


Special prognostic tests can improve п c i 
ures of general intelligence and previous academic achievement 1n 


related areas enough to justify their use. The demonstration that 


262 THE MEASUREMENT OF SPECIAL APTITUDES 


they can has not been sufficiently impressive to result in widespread 
adoption of the tests. 

Special prognostic tests seem likely to be more uscful as predictors 
of success in rather special types of academic tasks that have had no 
counterparts at earlier levels of school experience. Thus, the Turse 
Shorthand Aptitude Test, for which a correlation of 0.67 with later 
achievement in shorthand has been reported, should probably be con- 
sidered a useful supplement to other information about the pupil in 
evaluating probable success in shorthand training. The ERC Steno- 
graphic Aptitude Test and the Bennett Stenographic Aptitude Tests 
have given comparable results. The tests include such tasks as spell- 
ing, transcribing symbols, dictation under speed pressure, and word 
discrimination. 


PROFESSIONAL-SCHOOL APTITUDE BATTERIES 


One other group of aptitude tests, so-called, are the tests that 
have been developed to select individuals for particular types of pro- 
fessional training. Many types of professional schools, sometimes 
individually but more often operating through their professional or- 
ganizations, have instituted testing programs for the selection of 
their students. Testing programs are in operation for selecting stu- 
dents for engineering, law, medicine, dentistry, veterinary medicine, 
nursing, and accounting, to mention a few. 

The tests used in these professional-school batteries tend to be 
tests of reading, quantitative reasoning, and apprehending abstract 
relationships, with the balance and emphasis shifted somewhat to 
conform to the academic emphasis of the particular training pro- 
gram. They are largely minor variations upon the same theme—a 
relatively high-level measure of scholastic aptitude. The different 
professional aptitude tests would correlate very substantially with 
one another or with a measure of general intelligence, and, indeed, 
it should be expected that they would because the abilities required 
to succeed in training for the 


different professions have much in 
common. The similarities outweight the differences. The common 
core is adapted to the professional field, as by giving more emphasis 
to quantitative materials for engineering and more to verbal materials 
for law. It is supplemented in some cases by rather highly specialized 
tests, for example, a test of chalk-carving for dentistry. These 


variations are superimposed upon the basic theme of scholastic 
aptitude. 


MEASUREMENT OF MUSICAL APTITUDE 263 


MEASUREMENT OF MUSICAL APTITUDE 


When we come to such fields as music and art, the need for special 
measures of aptitude becomes quite apparent. Grades in these sub- 
jects are usually among those least well predicted by general measures 
of scholastic aptitude. Furthermore, the specialized nature of out- 
standing talent in these fields has long been recognized. Our prob- 


lem is to determine what the components of this talent are and devise 


ways of appraising them. 

arge component is executive or motor, the 
ability to master the patterns of action required for playing an 
instrument, Aptitude measures have largely avoided this domain, 
cificity to a particular instrument. Most 
1 toward the perceptive and interpre- 


In musical ability one | 


perhaps because of its spe 
measurement has been directec 
tive aspects of music. 

Hearing music involves in the first place various types of sensory 
discrimination—discrimination of pitch, of loudness, of temporal re- 
lations. It involves in the second place perceiving the more complex 
musical relations in the material, interval relationships, the pattern 
of a melody, the composition of a chord, the relationship of a har- 
mony to a melody. Third, it involves esthetic judgments about the 
suitability and pleasingness of a melody or harmony, a rhythmic 
pattern, or a pattern of dynamics. 

The most thoroughly investigated musical aptitude test battery, 
the Seashore Measures of Musical Talents, is directed primarily toward 
Measuring simple sensory discriminations, though with some atten- 
tion to perceiving slightly more musical material. The tests have 
analyzed music down so far that very little music remains. Thus, 


there are the following subtests: 


g which of two tones is higher. 


1 

2. Discrimination of loudness: h of two sounds is louder. 

3. Discrimination of time interval: judging which of two intervals is longer. 
егег two rhythms are the same or dif- 


"imination of pitch: judgin 


+ Dis 
judging whic 


н 4. Judgment of rhythm: judging W 
ferent. 
5. Judgment of timbre 
6. Tonal memory: judging whe 


ich of two tone qualities is more pleasing. 


^; judging wh 
the same or different. 


ther two melodies are 


raph records, with a series of items of each 


The items are on phonog 
ments become progressively more 


type. Within each type, the judg 
difficult. 
The analytic approacl 


list of subtests. Critics h 
ay from any genuine 


| to musical aptitude is evident in the above 


d that the analysis has removed 


ave contende 
ly musical material and that 


the tests a great w 


264 THE MEASUREMENT OF SPECIAL APTITUDES 


fine discriminations of pitch, time, and intensity are really not called 
for in the activities of the musician. Validity studies of the Seashore 
tests have been somewhat conflicting, yielding appreciable correla- 
tions with measures of musical success in some instances and very 
low correlations in others. The value of the analytic test is still a 
matter of doubt and controversy. 

Contrasting rather markedly with the Seashore type of test are 
the Wing Standardised Tests of Musical Intelligence. These tests, 
developed in England, were designed to stay as close as possible to 
the actual materials of music. The following subtests are included: 


1. Chord analy. 

2. Pitch change: 
chord. 

3. Memory: detecting which note is changed when a short melodic phrase 
is repeated. 

4. Rhythmic accent: judging which of two performances of the same piece 
has the better rhythmic pattern. 

5. Harmony: judging which of two harmonies is more appropriate for a 
melody. 

6. Intensity: judging which of two playings of the same piece has the more 
appropriate pattern of dynamics. 

7. Phrasing: judging which of two renditions has the more appropriate 
phrasing. 


$: detecting the number of notes in a single chord. 
detecting the direction of change of one note in a repeated 


This test is made up in part of tests that call for perceiving musi- 
cal relationships and in part of tests that call for esthetic choices in 
intact musical material. Information on the validity of the test is 
still limited, but what there is seems very promising. Thus, the 
author reports? correlations of 0.64, 0.78, and 0.82 with teachers’ 
rankings in three small samples. If these are maintained in future 
studies, a test like the Wing Test would appear to have a very real 
place in guidance of young people who have musical aspirations or 
whose families hold such aspirations for them. The tests are, unfor- 
tunately, not very well recorded and not very readily available in 
the United States. 


TESTS OF ARTISTIC APTITUDE 


Several types of tests are available relating 
In the first place, there have been tests of esthe 
field is now fairly well dominated by the 
Each item consists of a pair of pictures c 
acknowledged masterpiece. 


to aptitude for art. 
tic judgment. That 
Meter Art Judgment Test. 
of art objects. One is an 
The other is that same masterpiece sys- 


tematically distorted in some specified way. The examinee must 


TESTS OF ARTISTIC APTITUDE 265 


choose the better picture in each pair, the test blank indicating the 
respect in which the two specimens differ. 

A new test of the judgmental aspect of art ability is the Graves 
Design Judgment Test. This differs from the Meier Test in that all 
the items consist of abstract and non-representational material. The 
members of a pair differ in some single aspect of design, i.e., balance, 
symmetry, variety. Judgment of design is presumably divorced from 
апу particular object or content. 

In an attempt to get at the productive, as distinct from the purely 
judgmental, aspect of art, several tests (Morn, Knauber, Lewerens) 
require the subject to produce drawings, based on certain limiting 
“givens.” Thus, in the Morn Art Aptitude Inventory, a pattern of 
lines and dots is provided, and from this material the examinee must 


produce a sketch. The type of item is indicated in Fig. 10.4. 


Fig. 10.4. Example of type of item used in Horn Art Aptitude Inventory. 


The products must be evaluated by subjective rating, according to 
authors, but they present some evidence that 


standards given by the E 
n by non-artists. 


this can be done rather reliably eve 


266 THE MEASUREMENT OF SPECIAL APTITUDES 


The Lewerenz Tests in Fundamental Abilities of Visual Art use dot 
patterns to elicit drawings, whereas the Knauber Art Ability Tests 
use various assigned drawing tasks. Both these last two tests also 
present problems in shading, perspective, and composition. 

Art tests have been rather generally successful in differentiating 
art students or art teachers from other groups. However, it has been 
argued that they accomplish this because thev are in large measure 
achievement tests rather than aptitude measures. There has been 
relatively little study of these tests as aptitude measures with un- 
trained individuals. Studies of art students have indicated that test 
performance is reasonably predictive of later art-school success. 
Thus, Horn and Smith * find a correlation of 0.66 between score on 
the Horn test at the beginning of the усаг and average faculty rating 
of success in a special high-school art class at the end of the year. 
Barrett! correlated four art tests with grades in a ninth-grade art 
course and with ratings of pupils’ art products, with the following 
results: 


Course Ratings of 
Grade Product 
McAdory Art Test 0.10 0.13 
Meier Art Judgment Test 0.37 0.35 
Knauber Art Ability Test 0.33 0.71 
Lewerenz Fundamental Art 
Abilities Test 0.40 0.76 


Thus the last two tests, requiring production of drawings by the 
examinee, had about the same correlation with grades as did the 
Meter Art Judgment Test but much higher correlations with appraisals 
of student products. 


We can see from the above that the test tasks that require art 
students to do the sorts of tasks they will be taught to do in art class 
predict their later achievement. How far down to untrained pupils 
this can be pushed remains to be determined. 

Since the keying of art tests of all types depends upon a pooling 
of judgments, obtaining a high score requires conformity to the ac- 
cepted esthetic standards. There is real question as to the applica- 
bility of these tests (or the tests of musical aptitude) in a distinctly 
different culture. There is also the possibility, though it is a fairly 
unlikely one, that a highly talented but unconventional person will 
be penalized on the tests 


REFERENCES 267 


SUMMARY STATEMENT 


Though general intelligence tests bear some relationship to success 
in many fields, efficient vocational guidance or personnel classifica- 
tion calls for tests more specifically directed at the abilities called 
for by cach kind of job. Analytical studies of human abilities support 
the genuineness апа importance of these special abilities. Numer- 
ous tests of special abilities have appeared, and more recently tests 
d into comprehensive aptitude bat- 
1 classification. 


of this sort have been organize! 
teries for use in vocational guidance or personne 

Special tests to evaluate readiness to undertake particular educa- 
tional tasks have also been developed. The most widely used of 
Other tvpes of prognostic tests 
because their function is reason- 
aptitude and academic 


these are reading readiness tests. 
have not been widely used, perhaps 
ably well served by measures of scholastic 
Professional aptitude batteries appear to be varia- 
me of scholastic aptitude tests. 

produced a number of ability 


achievement. 
tions upon the basic the 

The fields of music and art have 
highly analytic tests have not been very clearly 


tests. However 
an unknown admixture of 


successful. More complex tests involve 
These show reasonably good validity and may 


previous training. 
ast relatively objective way of apprais- 


provide an improved and at le 


ing status and, hence, promise in the field. 


REFERENCES 


standardized art tests to de- 


1. Barrett, Н. O., An examination of certain 
and to intelligence, J. educ. 


termine their relation to classroom achievement 
Res., 1949, 42, 398-400. 
2. Bennett, С. K., and Ruth M. 


New York, Psychological Corp.. 1048. 
3. Bennett, G. K.. and Ruth M. Cruickshank, А summary of manual and 
mechanical ability tests, New York, Psychological Corp., 1942. v. 
4. Gates, А. I., G. I. Bond, and D. H. Russell, Methods of determining 


reading readiness, New York. Columbia University, Bureau 


of Publications, 1939. 
5. Ghiselli, E. E., The validity of 
Univ. Calif. Publ. Psychol., 1949. 5, 253-388. 

6. Horn, C. А.. and L. F. Smith, The Hort Art Aptitude Inventory, J. appl. 
Psychol., 1945, 29, 350-355. 

7. Lee, VI. J., and W. W. Clark. Lee-¢ а 
Los Angeles, Calif.. California Test Bureau. 1051. З | д А 

8, Thorndike, R. I. Tests and long-time prediction of vocational choice, 
in Traxler, A. E.. Editor. Strengthening education at all levels, Report of Eight- 
sponsored by the Educational Records Bureau 


Cruickshank, А summary of clerical tests, 


Teachers College. 


commonly employed occupational tests, 


‘lark Reading Readiness Test: Manual, 


centh Educational Conference 


268 THE MEASUREMENT OF SPECIAL APTITUDES 


and the American Council on Education, Washington, D. C., American Coun- 
cil on Education, 1954. 

9. Wing, H., Tests of musical ability and appreciation: An investigation 
into the measurement, distribution, and development of musical capacity, 
Brit. J. Psychol. Monogr. Suppl., 1948, 8, No. 27. 


SUGGESTED ADDITIONAL READING 


Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ters 14-16, and 19. 

Kandell, I. I., Professional aptitude tests in medicine, law, and engineering, 
New York, Teachers College, Columbia University, Bureau of Publications, 
1940. 

Monroe, Walter S., Editor, Encyclopedia of educational research, rev. ed., 
New York, Macmillan, 1950, pp. 874-894. 

Super, Donald E., Appraising vocational fitness by means of psychological 
tests, New York, Harper, 1949, Chapters 4, 6, 8-11, and 15. 


QUESTIONS FOR DISCUSSION 


1. A number of aptitude test batteries have been developed for use at the 
secondary-school level, but almost none for the elementary school. Why is 
this? Is it a reasonable state of affairs? 

2. What are the advantages in using a battery such as the Differential A pti- 
tude Tests instead of tests selected from a number of different sources? What 
are the limitations? 

3. Step by step, what would need to be done to set up a program for se- 
lecting students for a dental school? 

4. How could a high-school counselor use the data of Table 10.1? What are 
the limitations on the usefulness of this material? 

5. How might the counselor use the data of Table 10.22 What are its 
limitations? 

6. Comment on the statement: The best measure of aptitude in any field 
is a measure of achievement in that field to date.” ; 

7. What are the differences between a reading readiness test and an in- 
telligence test? What are the advantages of using the readiness test rather 
than an intelligence test for first-grade pupils? 

8. To what extent are tests like the Zorn Art Test and the Wing Music Test 
measures of aptitude? To what extent are they measures of achievement? 

9. What factors tend to make tests of artistic and musical aptitude some- 
what less useful than other types of aptitude tests? 

10. In what ways could a follow-up study of graduates of a high school 
help in improving the school guidance program? 


Chapter 77 


Achievement Tests 


STANDARDIZED VERSUS TEACHER-MADE TESTS 


We turn now to tests of school achievement. We shall be con- 
ercially available standardized tests, 
r other types of appraisal devices in 
order to see how the standardized test fits into the total picture. 
As we indicated in the previous chapter, the distinction between an 
aptitude test and an achievement test is a somewhat blurred one. 
However, we shall be interested now in measures of knowledges and 
skills that are closely tied to organized school instruction, and in 
measures that are being used primarily to appraise present status in 
those school-taught knowledges and skills. 

Standardized tests do not represent anything new and strange in 
nt of academic achievement. Thev are blood brothers 
sts that were discussed in Chap- 


cerned particularly with comm 
though we shall need to conside 


the measureme 
of the short-answer teacher-made te 
ter 4. They аге made up of the same types of items and cover 
many of the same areas of knowledge. In what ways, then, do they 
differ from teacher-made tests? What are the advantages and limi- 
tations of each? For what purposes should each be used? 


DISTINCTIVE FEATURES OF STANDARDIZED TESTS 
in which commercially distributed 


There are four main ways 
sts that the individual teacher 


standardized. tests differ from the 
would prepare for his own class. 

s based on the general content and ob- 
country over, whereas the 


te 


1. The standardized test i 
schools the 


Jectives common to many 
1 to content and objectives specific 


teacher's own test can be adaptec 
to his own class. 

2. The standardized test de 
Or skill, whereas the teacher-made test с 


to any specific limited topic. 


als with large segments of knowledge 
an be prepared in relation 


269 


270 ACHIEVEMENT TESTS 


3. The standardized test is developed with the help of professional 
writers, reviewers, and editors of test items, whereas the teacher- 
made test must usually rely upon the skill of one or two teachers. 

4. The standardized test provides norms for various groups that 
are broadly representative of performance throughout the country, 
whereas the teacher-made test lacks this external point of reference. 


The distinctive features of the standardized test represent impor- 
tant advantages for some purposes and disadvantages for others. 
Basing the test upon a careful analysis of the common objectis 
expressed in textbooks, courses of study, and reports of committees 
of professional societies should guarantee that the thinking of many 
specialists has entered into the test plan. However, a published 
test is fixed for a period of years in terms of broad, common objec- 
tives. It is not a flexible tool. It cannot be adapted to special 
current needs, to local emphases, or to particular limited units of 
study. 

The value of standardized tests lies particularly in situations in 
which comparisons must be made—comparisons of a school with 
other schools, comparison of achievement in different areas by a 
pupil or by a school group, or comparison of achievement with the 
potentiality for achievement indicated by an aptitude test. The 
norms provided with standardized tests make 
readily possible. School achievement may be 
tional norms. Age or grade level in different subjects may be com- 
pared for pupil or for group. Age or grade level on an achievement 
test may be compared with age or grade 
lastic aptitude. 


such comparisons 
compared with na- 


level on a measure of scho- 


FUNCTIONS OF STANDARDIZED AND TEACHER-MADE TESTS 


In the light of these differences, we 
should be placed on teacher-made 
to 


propose that chief reliance 
tests when we want to test in order 


1. See how well students have mastered 
2. Determine the exte 
been achieved. 


à unit of instruction. 
nt to which distinctive local objectives have 


3. Provide a basis for assigning course marks. 


Standardized tests should be used when we wish to test in order to 


1. Compare achievement with potentiality for 
group. 


an individual or a 


ANALYZING OBJECTIVES OF AN AREA 271 


2. Compare achievement of different skills or in different subject 
arcas. 

3. Make comparisons between different classes and schools. 

4. Study pupil growth over a period of time to see whether prog- 
ress is more or less rapid than might be expected. 


For some purposes, such as pupil diagnosis, we may wish to use not 
only standardized and teacher-made tests but a variety of informal 
testing and observational procedures as yell. 

We все, therefore, that standardized and teacher-made tests both 
have important functions to perform in the educational economy. 
To a large extent they are different functions. The two types of 
evaluation supplement one another. They are not competitors. 

Standardized tests of achievement have been developed for prac- 
tically every subject in the school curriculum. It would be impos- 
sible to give even a brief treatment of all subject areas in the pages 
that can be allotted to achievement tests in this book. We have 
decided, therefore, to concentrate on tests in a single area, in the 
belief that a fairly full treatment of this area will serve to introduce 
almost all the major problems and issues that would be encountered 
in dealing with tests in any area. We have chosen to discuss reading 
tests for two reasons. In the first place, these tests are more widely 
used than those in any other subject area. In the second place, a 
great variety of both ‘survey and diagnostic procedures have been 
1 area, so that we shall get an introduction to a wide 
spectrum of testing techniques. Readers especially interested in tests 
in other areas can get names and evaluations of available tests in 
their field from Buros’ Mental Measurements Yearbooks (see Chap- 


ter 8), 


developed in this 


ANALYZING THE OBJECTIVES OF AN AREA: 
THE FIRST STEP 


Before we can proceed intelligently with either the preparation 
of a test in an area or the choice of one from among those already 
and define clearly the objectives of our 


existing, we must analyze $ 
Is test A a good test? Good 


instruction in the field in question. й 8 , 
for what? What are we trying to produce in children? What do 


we want or need to evaluate? Only when we have in our minds a 
further questions can we answer the question: 


clear answer to these ; à ) 
lvsis of objectives helps us to identify 


Is this a good test? Ап ana 
the strengths and weaknesses of our evaluation program. 


272 


ACHIEVEMENT TESTS 


When we look at our statement of objectives, we will see some for 
which an existing test seems appropriate, some for which testing 
procedures might be devised if sufficient ingenuity were available, 
and some that seem obviously inaccessible by test procedures. Con- 
sider the following statement of objectives in the field of reading. * 


I; 


* Adapted from Greene, Harry A., and William S. Gray, 
understanding in the language arts," The Forty-F 
Society for the Study of Education, Part 1, 1946. 


Basic attitudes, skil 


Essential Knowledge, Attitudes, Skills, and Procedures in Reading 


and procedures involved in securing meaning in 


both reading and listening. 


А, 
В. 
C. 


- To integrate the ideas acquired with previous e 


To respond to the motive, problem, or purpose. 

To direct attention to the meaning of what is read. 
То develop fluent, accurate perception of word forms. 
1. Accurate discrimination of word forms. 

2. Accurate perception of both form and meaning 
3. Association of right meanings with word forms. 
4. Accurate perception of words in context. 

5. Fluent perception of words. 


- To secure an adequate understanding of what is read. 


1. A clear grasp of meaning, involving: 
a. Selection of meanings of words appropriate to the context. 
b. Fusion of the meanings of words into a chain of related ideas. 
c. Recognition of the importance and relationship of the ideas 
acquired. 
2. Coping successfully with such factors as: 
a. Unusual word order. 
b. Complexity of sentence Structure. 
c. Abstract ideas. 
3. Interpreting meaning in the light of its broader context. This 
ability implies an understanding of: 
a. The total setting of the ideas expressed, 
b. The author's mood, tone, and intention, 
4. Supplementing the Specific meanings apprehended. 
a. Reading between the lines. 
b. Seeing implications. 


. To react critically to what is read. 


1. Recognizing the value, usefulness, 
what is read. 

- Judging the validity or truthfulne. 
passage. 

3. Judging the accuracy or completeness of the author's conclusions. 

4. Recognizing whether or not the reasoning of the author is sound. 

5. Identifying and resolving propaganda. 


timeliness, and significance of 


ко 


55 of the ideas presented in the 


i E xperience so that the 
following evidences of understanding may be noted immediately, or 
later: d 


"The measurement of 
Yearbook of the National 


ANALYZING OBJECTIVES OF AN AREA 273 


1. New insights are acquired. 
2. Previous understandings are reaffirmed or modified. 
3. Challenging problems are solved. 
4. Rational attitudes are acquired. 
5. Behavior is modified. 
6. Interests are broadened. 
7. Richer and more stable personalities are developed. 
II. Supplementary attitudes, skills, and procedures essential in many silent- 
reading activities. 
A. To locate needed information. 
1. Using an index. 
Using a table of contents. 
. Using the dictionary. 
Using card files. 
. Using reference books. 
o gather and evaluate information in the light of a given purpose. 


wr 


те 


B. 1 
1. Recognizing the purpose to be achieved. 
2. Applying appropriate fact-finding techniques. 
3. Sorting essential from non-essential information. 
4. Judging the validity and significance of relevant information. 
5. Organizing the information in terms of the purpose or problem. 


6. Drawing tentative conclusions. 
7. Deciding when the purpose has been achieved. 
C. To adjust reading attitudes and procedures to different purposes. 
1. Modifying interpretative processes in light of the purposes to be 
achieved. As, for example, 
a. Reading to answer factual questions. 
b. Reading an organized body of material to report. 
Reading to determine the accuracy of the facts or events de- 


scribed. 


2. Adjusting rate of reading to the purpose. 


6 


If we accept this as a working outline of outcomes that might be 
evaluated in the field of reading, then we may ask: For which of 
these objectives does test A provide an adequate appraisal? For 
Which of these objectives can we expect any standardized test pro- 
cedure to provide an adequate appraisal? For which must some non- 
test procedure be developed as an appraisal device? 

With this statement of objectives before us, let us take a look at 
à widely used elementary-school reading test. Suppose we take the 
Metropolitan Reading Test, Intermediate Level, Form R. The types 


of items composing the test are illustrated in Fig. 11.1. 
the test items and refer to our statement of objec- 


tasks appraise primarily those objectives 
Specified under 1 C and I D. The test of word knowledge assesses 


Most directly I C 3, “association of right meanings with word forms, 
5 accurate discrimination of word 


As we examine 
lives, it appears that the 


involving as a prerequesite I C 1, disc ‹ 
forms," The test of paragraph reading, with its associated ques- 


274 ACHIEVEMENT TESTS 


tions, relates most directly to I D 1, “а clear grasp of meaning" and 
I D 2, "coping successfully with such factors as: (a) unusual word 
order, (b) complexity of sentence structure, and (c) abstract ideas." 
This builds, of course, upon perception of words and knowledge of 


2-3. Billie has a brown dog. His (2) has a white spot on his (___)2 
back, and so Billie (3) him Spotty. ч 


The fur of the mink is popular both because of its quality and its rarity. 
ng minks for their furs is called mink ranching. The pelts from ranch 
minks are now considered better than those from wild minks. Because minks 
fight with each other, they are kept in separate small pens. These have wire 
floors and tops and are raised off the ground. The minks are fed mostly raw 
meat or fish. The young minks, called kits, are born in spring, sometimes as 
many as ten in a litter. In the early winter, they are killed, and their s 
or pelts are prepared for use in fur coats and other articles, 


41. Why are several minks not kept in the same pen? 
42. What is another name for the skin of a fur-be: 


aring animal? t= 

43. What two foods are the minks most commonly fed? و‎ 
44. Of what is the floor of the mink's pen made? — 
45. When are the minks killed for their furs? 45 
46. What is another name for the baby minks? 18 )46 
37. Write in the answer space the letter which appears in front of 

the best title for this paragraph 

a. Wild Minks b. Making Mink Coats 

c. Raising Minks for Fur d. Feeding the Minks ( nn 
1. A friend is one we 1 strike 2 throw 3 love 4 fear ( 1 
2. A noise isa 1 smell 2 sound 3 joke 4 song ( )2 

Fig. 11.1. 


Sample items from the Metropolitan Reading Test, Intermediate Level, Form R * 


word meanings. The missing-word items seem harder to relate spe- 
cifically to certain stated objectives but appear to fit more or less 
into the same pattern. Clearly, this test gives one little chance to 
test the reader's ability or tendency “to react critically to what 
read" (I E), “to integrate the ideas acquired with previous experi- 
ence" (I E), or to exhibit the various types of supplementary atti- 
tudes, skills, and procedures indicated under II. j 
Tests might be developed to appraise others of these objectives. 
For example, the Tord Silent Reading Tests have 
with certain of the information-getting skills 
These skills are quite susceptible 


5 


subtests dealing 
listed under Il A. 
to appraisal by objective test pro- 


* Reprinted with the permission of the World Book Co. 


SURVEY READING TESTS 275 


cedures. The Progressive Educational Association has developed a 
number of tests, until recently distributed by the Cooperative Test 
Division of the Educational Testing Service,* that are directed at 
certain of the components of ability "to react critically to what is 
read" (1 E). The tests were not originally designed specifically as 
reading tests but as tests of significant communication skills. These 
involve selecting and appraising possible conclusions, identifving 
propaganda tricks, analyzing the facts of a story, the author's point 
of view, and the character's motivation, and identifving relevant and 
irrelevant statements in support of a given conclusion. Within limits, 


these skills are testable. 

Still other objectives would be very hard to appraise in any lim- 
ited testing situation. It would be difficult to determine how well 
the individual had been able “to integrate the ideas acquired with 
previous experience" (I F) or "to gather and evaluate information 
in the light of a given purpose" (II B). Even more difficult might 
be the appraisal of the extent to which the individual will read and 
the type of reading he will select, as distinct from the extent to which 


he can read. 


In this discus 


sion we have been trying to make four main points. 
These points refer not only to reading but to any segment of the 


school program. The points are these: 


1. The teaching of reading, or of any segment of the school pro- 
gram, is a complex undertaking looking to a variety of different out- 


comes, А А 
2. A specific existing test will provide an appraisal of only certain 


ones of the desired outcomes. Р 
3. Some of the desired outcomes are not likely to be reached by 
any test procedures. 
+. The evaluation of 
tion of your objectives 
extent its content conforms to those 


to achieve. 


any test of achievement requires a formula- 
any tes 

and an analysis of the test to see to what 
outcomes that you are seeking 


SURVEY READING TESTS 


One of the most widely used types of standardized test bir 
vey reading test. Because of the importance of reading skills in a 
У reading test. 8 


* Q S ake a "cl for 
aspects of the school program, many schools make a special effort 


* The Educational Testing Service does not plan to continue printing niches 
ieu c opo a irect inquiries to the Educational Test- 
Vill give permission to reproduce them. Direct inquiries to the Educa 

ing Service, 20 Nassau St., Princeton. N. J. 


276 ACHIEVEMENT TESTS 


to appraise these skills as a basis for planning group activities and 
individual remedial action. As was suggested on pp. 272-273, read- 
ing is a complex and far-reaching enterprise. The commercial sur- 
vey tests undertake to appraise only certain aspects of this range of 
skills. In Appendix III a number of the better-known and more 
widely distributed reading tests are listed, showing the grade levels 
for which forms are available, the types of subtests that are included 
in each, and certain other items of practical interest about each. 

The subtests most frequently included in survey reading tests are 
word knowledge and paragraph reading. The test of paragraph 
reading usually involves paragraphs of some length with questions 
based upon each, though the pattern of a missing word or words to 
be supplied is sometimes used (Metropolitan and Stanford) and the 
technique of requiring the reader to identify the word which spoils 
the meaning has also been tried (Science Research Associates). The 
paragraph with questions seems to correspond most naturally to 
the normal reading task. 

The paragraph-with-questions pattern still leaves room for a wide 
range of variation in the processes that are actually tapped by the 
test. Only a critical examination of the single test items will enable 
the potential user to tell how many items call merely for knowing a 
word meaning, how many call for selecting the particular meaning 
which fits the context, how manv for the answer to 


i a specific factual 
question that is answered in the passage, how many for an inference 
based upon information given in the passage, how 


2 2 many for getting 
the main idea or theme of the p 


assage, how many for sensing the 
author's mood or purpose, and how m 


devices used in the passage. 
these things and more. 


any for recognizing literary 
Reading with understanding is all 
thes The different components are represented 
in different tests in very different proportions. 

Fortunately, these different skills are all positively correlated, so 
that a survey based on some combination of the skills will tend to 
rank pupils in fairly much the same Way as a survey based on a rather 
different balance of them. The child who does well ona test made up 
of directly factual items will in most cases tend to do well on one 
involving items of inference and synthesis. But the correlation is 
far from perfect. The potential user must examine each test in which 
he is interested, bearing in mind the specific types of comprehension 
skills that he deems important, in order to judge whether that par- 
ticular test is the one best suited for his purposes. In the 


: same way, 
a test in any subject matter must be scrutinized to see 


whether the 


MEASURING SPEED OF READING 277 


test items represent the balance of information, understanding, and 
application that corresponds well with the objectives of teaching in 
the using school. 


MEASURING SPEED OF READING 

An additional factor that enters in to complicate the appraisal of 
reading achievement is the factor of speed. Speed of performance is 
а complicating factor in measuring achievement in any area, but per- 
haps especially so in the case of reading. To what extent do we want 
o enter into our score? Do we wish to penalize the 
person who is a slow worker but can accomplish a good deal if we 
give him time? Or do we want to get a pure measure of power, 


speed per se t 


uncontaminated by speed of work? 

Different tests resolve this problem in somewhat different ways, 
but generally speaking good testing practice accepts as its goal the 
separate measurement of speed of performance and level of perform- 
the objective is to provide enough time 
on any test so that each individual has had an opportunity to pro- 
gress as far as his ability permits. This means either that he has had 
time to try all the items on the test or, if the items are graded in 
difficulty, that he has had time to work along to the point where 
he can no longer succeed with any of the test tasks. 

In practical testing, this goal can only be approximated. Tests 
must be given with a definite time limit if they are to be fitted into 
a school program. Furthermore, it does not make for good testing 
conditions to have time limits so long that half the group is sitting 
around fidgeting. Time limits for a test designed to be a power test 
are usually: figured so that most of the pupils have time to try mos! 


of the items. 

But in the case of reading, 
own sake. Slow, laborious reading is i 
With some materials, it is desirable to be able to skim rapidly in 
order to cover a large amount of material in a short time. For this 
Ў undertaken to include separate meas- 


ance. In measuring level, 


we are also interested in speed for its 
nefficient and time-consuming. 


reason, a number of tests have 


ures of speed of reading. : и 
The measurement of speed also presents its problems. We start 


à group of pupils reading an extended passage. At the end of 2 
minutes we stop them and tell them to mark the word they were 
reading when the signal was given. But how do we know that they 
actually read the intervening material? Some may have read it 
word for word, others only skimmed, and others just read parts. 


What does “reading the passage really mean? 


278 ACHIEVEMENT TESTS 


Various devices have been tried in the attempt to make the task 
more uniform for different examinees. In some tests, such as the 
Iowa Silent Reading Tests and the Traxler High School Reading Test, 
the reader is instructed to read in such a way that he will be able to 
answer questions. The Gates Reading Survey for Grades 3 to 10 uses 
verv brief paragraphs, cach ending with a multiple-choice question, 
and score is the number of questions answered within the time limit. 
The Michigan Speed of Reading Test includes in each two-sentence 
unit one word that spoils the meaning of the unit. The reader must 
cross out these words. But these devices are only partly satisfac- 
tory. Any device that doctors up the actual text tends to distort 
the normal process of reading, and vet we cannot rely upon instruc- 
tions alone to bring about comparable care and thoroughness by 
different readers. The reading speed test is at best very dependent 
upon the instructions given to the ехатіпсеѕ and provides only an 
approximate indication of the relative speeds at which different peo- 
ple can read when degree of care and comprehension are uniform for 
cach person. 


SUMMARY STATEMENT 


The survey test is, then, a sampling of the tasks that comprise a 
particular skill or knowledge. Only certain aspects of the total skill 
are represented, and the balance is a little different in each test. 
Most survey achievement tests have reasonably satisfactory reliabil- 
ity. The main problem is to pick the one that, in its content, pro- 
vides the best balance of skills for your particular objectives. 


DIAGNOSTIC TESTING 


A survey achievement test undertakes to provide a general, over- 
all appraisal of status in some area of knowledge or seil; A diag- 
nostic test undertakes to provide a detailed picture of strengths and 
weaknesses in an area. Furthermore, it is anticipated that this de- 
tailed analysis will suggest causes for general deficiencies and provide 
a guide for remedial procedures. A survey reading test tells us that 
Johnny, who is starting the fourth grade, performs on our test of 
reading paragraphs at a level typical of the usual child beginning 
the second grade. A series of diagnostic tests indicates that Johnny 
has a fair sight vocabulary of common words but no skills for working 
out unfamiliar words, that he is unable to blend sounds to form words. 
that he does not recognize the sounds that correspond to letter com- 


binations, and that he makes frequent reversal errors. These find- 
ings, together with others, provide the basis for planning remedial 


DIAGNOSTIC TESTING 279 


teaching of word analysis and phonic skills that are specifically di- 
rected toward Johnny's deficiencies. Development of diagnostic 
tests involves two steps: (1) analysis of the complex performance—be 
it reading, multiplying fractions, or using a microscope—into its 
component subskills and (2) developing tests for the component 
skills, free as far as possible from any other source of difficulty. 

It has become fashionable in recent years to call many tests di- 
agnostic tests." In a sense, any test that vields more than a single 
over-all score is diagnostic. Even if there are only two part scores, 
say, one for word knowledge and one for paragraph comprehension, 
for us to say that Johnny showed better 
id in reading connected prose. 


the test makes it possible 
ability in word knowledge than he d 
This is certainly one diagnostic clue. Diagnosis is, after all, a matter 
of degree. We may probe and analyze with varying degrees of 
thoroughness and detail. We must ask of any test purporting to be 


diagnostic: How complete and how adequate are the diagnostic 


cues that this test provides? It is easy to overstate the value of the 


diagnostic information provided by a particular test. 

Diagnostic testing faces а very troublesome dilemma. How is the 
test to provide sufficient diagnostic detail and yet appraise each 
separate ability with sufficient reliability? The essence of good diag- 
ў t many distinct and relevant facts about 
appraisal of each of the component 
rformance has been analyzed. At 


nosis is that one should ge 
the individual. One wishes an 
abilities into which the complex pe 
the same time, it is important that the separate appraisals have ade- 
quate reliability 

Reliable appr: 
for two reasons. 
almost every instance intereste 
strengths and weaknesses with which we 
ages or group comparisons are of no parti 
context. We cannot fall back upon averages to balance out the 
a particular pupil. We need an accurate 
This is made more acute by the 


aisal is particularly important in diagnostic testing 
In the first place, in diagnostic work we are in 
id in the individual. It is his personal 
are concerned. Group aver- 
cular interest to us in this 


chance errors in measuring 
appraisal of the specific individual. cute 1 
fact that we are dealing with differences between the individual's 
performance in related tasks. We are interested in making such a 
statement as: “This pupil's ability to pick out the main idea in what 
he has read is poorer than his ability to answer questions on specific 
factual details." But the two 
How reliably can we measure the differences between the two? 

At this point, the student could well refer to the discussion of 
profiles on pp. 172-182. A set of diagnostic scores 19 a specific 
All the issues about the reliability of a differ- 


abilities are quite closely. related. 


instance of a profile. 


280 ACHIEVEMENT TESTS 


ence score that were raised in that discussion apply very acutely to 
the case of diagnostic tests. Since we are dealing with different 
aspects of a single field, correlations between tests are likely to be 
fairly substantial and the loss of reliability to be considerable when 
we have to think about differences. One would think, this being во, 
that authors of diagnostic tests would have been particularly con- 
cerned about the reliability of their instruments. But, alas, this has 
not generally been the сазе. The temperament that becomes ex- 
cited about problems of diagnosis appears to be different from the 
temperament that grows concerned about issues of reliability of meas- 
urement. It must be confessed that in many cases the reliability of 
diagnostic tests is quite modest and that in many others it is un- 
reported. 

All this means that diagnostic test results must be interpreted 
with caution. The tests provide some rough and quite tentative 
hypotheses as to the individual's strengths and weaknesses. But 
they must be clearly recognized as tentative hypotheses and nothing 
more. The test profile suggests possible causes for the present diffi- 
culty and a jumping-off place for remedial work. If the remedial 
activities are successful, well and good. If not, the remedial teacher 
must stand ready to review his hypotheses and to explore other 
leads. Diagnostic test results are suggestions, not commands. 

We find several types of diagnostic instruments in reading, and 
these serve also to illustrate the varieties of diagnosis in other arcas. 
In the first place, we find tests with somewhat specialized subtests 
yielding scores for some aspect of the total function, This type is 
well illustrated by the Jowa Silent Reading Tests (Advanced Level). 
These have the following subtests, each supposed to represent a 
somewhat different aspect of reading skill. 


Test 1. Rate and Comprehension of connected prose. 
Test 2. 
questions. 
Test 3. Poetry Comprehension, including mood. metaphor, etc 
Test 4. Word Meaning in different content arc 7 
Test 5. Sentence Meaning of brief sentences out of context. 


А Test 6. | Paragraph Comprehension: selecting central idea and comprehend- 
ing essential details. 


Test 7. 


Directed. Reading of connected prose to locate answers to factual 


as. 


Location of Information: using an index, selecting key words. 


How many of these аге in fact both sufficiently reliable and suffi- 
ciently different to be usefully diagnostic is a ‘real question. For 
example, the reliabilities of Test 5, Sentence Meaning, and Test 6. 
Paragraph Comprehension, are reported (probably somewhat Op- 
timistically, since the coefficients are based on odd versus even 


DIAGNOSTIC TESTING 281 


halves and the tests have quite short time limits) as 0.751 and 0.759. 
The correlation between the two tests is reported as 0.48. From 
these values, we may estimate the reliability of the difference score 
to be 0.53. Inferences from a datum having this level of reliability 
should be made very cautiously. | 

The use of subtest scores such as those on the Joa is probably 
most justifiable for a class or larger group. With a group average, 
chance errors tend to cancel out. and the low reliability of the scores 
becomes less important. If the group as a whole shows some marked 
as in the use of indices and library aids, for example, this 


weakne 
as in which instruction has been neglected and 


may point out arc 
suggest directions for instruction for the group as à whole. 

A second approach to the diagnostic study of reading is through 
standard oral reading passages. One test of this type that has been 
's Oral. Reading Passages. The test con- 


used for many years is Gray 
and simple to 


sists of a standard set of passages. ranging from с 
quite difficult. The child who is being studied reads the passages 
aloud. The examiner uses a standard code to record on a copy of 
rrors and hesitations made by the pupil. Mis- 
pronounced words are underlined. Mispronounced vowels are shown 
arks. Omissions are encircled. Substi- 
Repetitions are indicated by 
a number of errors indicated 


the passages all the е 


by appropriate diacritical m 
tutions and insertions are written in. 
a wavy line. A sample record with 


upon it is shown in Fig. 11.2. 
The sun pierced into my {argo windows. It was the opening of Octo’ 


2 
ber, and thowky was Oe айана blue. I looked out of my window! 


down the street. Thg white nous) of the long stright street were 
sphere allSwed full play Lo, 


(nost painful to the eyes. The clear эшо. p 


the sun's brightness. 
Fig. 11.2. Example of reading passage taken from Gray's Oral Reading Passages. (Repro- 


duced by permission of the Public School Publishing Co.) 


ises is valuable for the insight 


The record of the child's oral respor 
The usual objec- 


tual process of reading. 
the product of a child's efforts, the 
a test booklet or answer sheet. If he does poorly 
a loss to know why In the oral 
errors as they happen —each hesitation, each 
In this way we can identify more specifi- 
are giving the child trouble. They are 
that the child is slow in reading the 
cstions based on it. 


that it gives us into the ac 
tive written test shows us only 
marks he makes on 
or makes mistakes, we are often at 


test we can see the 
omission, each reversal. 
cally the components that 
not lost in the one final result, 


passage or does poorly on comprehension qu 


282 ACHIEVEMENT TESTS 


The oral test as a basis for diagnosis can be illustrated in arith- 
metic also by the Buswell-John Diagnostic Test for Fundamental 
Processes in Arithmetic. This test consists of a series of graded 
examples. The examples are to be worked out by the child “think- 
ing out loud," telling what he is doing and why he is doing it at cach 
step. The examiner has a record sheet, with a code for types of er- 
roneous process Опе page of the record sheet is illustrated. in 
Fig. 11.3. The iminer uses this form to record errors made by 
the pupil as he speaks out his solution of the problem. A study of 
the types of errors that the pupil is making may suggest specific 
points at which the pupil needs help. This opportunity to gain in- 
sight into the way in which the pupil is attacking the task and to 
understand the nature of his errors is an advantage of oral testing 
procedures in whatever field they тау be used. 

In a third type of diagnostic test the test maker tries to analyze 
the complex task, such as reading, into its simpler components and 
test these components one at a time. Thus, the Gates Reading Di- 
agnosis Tests include tests of recognition of words, recognition of 
separate syllables, ability to blend the sounds of letter combinations, 
and recognition of the 


ingle letters, The complex skill is pushed 
back to smaller and smaller segments of the total task. The thought 
is that when a person fails on the complex task we test to see whether 
he is able to show the component skills of which the larger task is 
built. | 

This type of approach may be illustrated in another field by the 
Compass Diagnostic Arithmetic Tests. 
take to break up each complex skill 
nents—to test the simplest components first, and then to add on 
additional elements until the full task has been tested. Thus, the 
diagnostic test concerned with division of whole 
sections testing the child upon the following contributing skills and 
understandings: (1) the vocabulary of division, (2) fundamentals of 
short division, (3) short division with carrying, (4) the addition, sub- 
traction, and multiplication used in later subtests, (5) estimating 
the first quotient figure, (6) fundamentals of long division and check- 
ing, and (7) finding errors in long division, A study of scores on these 
subsections may provide insight as to where the trouble really lies. 

Related to this type of test is the test that is loaded with op- 
portunities to make a particular type of error, Thus, one test used 
by Gates in reading diagnosis is one in which the examinee reads а 
set of words that lend themselves to reversal errors, Le; was—saw, 
on-no. Such a test gives a concentr: 


In these the authors under- 
in arithmetic into its compo- 


numbers has sub- 


ated exposure and permits а 


DIAGNOSTIC TESTING 283 


айн» бина DIAGNOSTIC CHART е S 
n =e ome 


INDIVIDUAL DIFFICULTIES 


FUNDAMENTAL PROCESSES IN ARITHMETIC 
Prepared by G. T. Barwell and Leare Joha 


Grade. Age. 1Q 


Name. School. 
Add. ; Subt. ; Mult.. ; Div. 


Date of Diagnosis: 


Teacher's preliminary diagnosis 


ADDITION: (Place a check before each habit observed in the pupil's work) 


is Disregarded column position 


— al Errors in combinations т sit 

—— a2 Counting ——a16 Omitted one ог more digits 

— a3 Added carried number last ——з17 Errors in reading numbers 
——al8 Dropped back one or more tens 


— a4 Forgot to add carried number 


— a5 Repeated work after partly done ——al9 Derived unknown combination from familiar one 


220 Disregarded one column 


—— a6 Added carried number irregularly arded o 
— a7 Wrote number to be carried Dee! Error in writing answer 
— 28 Irregular procedure in column ase Skipped one or more decades 
— a9 Carried wrong number — 23 Carrying when there was nothing to carry 
io Grouped two or more numbers Desi Used scratch pape 
—all Splits numbers into parts Doss Added in pairs, giving last sum as answer 
—al2 Used wrong fundamental operation Dass Added same digit in two columns 
——^13 Lost place in column 497 Wrote carried number in answer 

ass Added same number twice 


a4 Depended on visualization 


Habits not listed above. 


(Write observation notes on pupil's work in space oppor! 


а) 
5 6 
2 3 
em 2 8 19 so 40 
9 4 13 39 
| 
Im 
Ong 13 | 78 46 
2 5 1 92 
(4) (8) 
i Ы 8 E 
a кр 2 1 
(ee. 


e Buswell-John Diagnostic Test for Fundamental 


Fig. 11.3. Example of record sheet used in th 
f the Public School Publishing Co.) 


Processes in Arithmetic. (Reproduced by permission o 


284 ACHIEVEMENT TESTS 


judgment of the susceptibility of the examinee to that particular 
type of error. Informal tests of this sort in such fields as language 
usage, spelling, etc., are, of course, familiar to any teacher who tries 
to check upon the effectiveness of his teaching of particular usages, 
rules, generalizations, and understandings. 

Finally, diagnostic testing in any given field must go beyond the 
immediate field of skill or knowledge and seek information on all 
the background factors that contribute to success or difficulty in 
the particular area. Thus, to understand the child with reading 
difficulty we need information on his vision, his hearing, his general 
intellectual level, even his interests and his emotional adjustment. 
So a thorough diagnostic study will include tests of visual acuity, 
muscular balance and fusion, testing with an audiometer to be sure 
the child can hear adequately, a non-reading intelligence test, and 
interview or questionnaire information about factors in the child's 
background and present life that may prove relevant. Diagnostic 
testing spreads out beyond subject boundaries and a full diagnostic 
study becomes essentially a directed case history of an individual, 
directed in that it is focused on the academic problem but compre- 
hensive in that it covers all potentially significant features of both 
the skill area and the individual's personal life. 

We have described a variety of diagnostic procedures in reading 
and in arithmetic. It is in these fields that the most work on diag- 
nostic procedures has been done. There are, in fact, few published 
diagnostic tests outside these fields, though there is certainly informal 
teacher diagnosis. Even in the fields of reading and arithmetic, 
relatively little information about the specific diagnostic tests is pro- 
vided by the authors. Evidence on reliability is meager, and norms 
are rather crude and fragmentary. Most diagnostic tests are not 
very elegant psychometric devices. They have not been sufficiently 
widely used to support the large investment in development and 
analysis that characterizes the more popular survey tests. Interpre- 
tation of test scores must, therefore, be made with particular care 
and a good deal of tentativeness. 


ACHIEVEMENT MEASURED THROUGH PUPIL 
PRODUCTS 


One type of achievement measure that cannot be well illustrated 
within the field of reading is the product scale. We can illustrate 
this type of appraisal in the field of handwriting. Here, the plan 
is to evaluate some performance of an individual, in this case his 


ACHIEVEMENT TEST BATTERIES 285 


handwriting, by comparing it with a set of standard samples. The 
standard samples are chosen by using the pooled judgments of a 
The judges are usually asked to consider speci- 


number of judges. 
The basic idea in this 


mens in pairs and decide which is better. 
type of scaling is that the larger the per cent of judges who agree in 
noticing a difference, the larger the difference. Thus, if 90 per cent 
of judges consider specimen A to be better than specimen B and 
only 80 per cent consider B to be better than C, the difference be- 
tween A and B is greater than the difference between В and C. 
If 50 per cent consider C better than D and 50 per cent consider 
D better than C, then С and D must be considered of equal merit. 
Equally perceptible differences are considered to be the same size. 
Thus, a difference that is agreed upon by 75 per cent of our group of 
d to be the same size wherever it occurred 


judges would be considere 
asc-of-perception stand- 


on our scale. Basing our scale units on this e 
ard, we can set up a scale of specimens from very poor to very good 
and assign a numerical value to each. 

When we use a product scale, the procedure is to compare the 
specimen of a pupil's performance with the set of standard samples. 
His product is moved up and down the set of standard samples until 
the judge decides which one it most nearly resembles. It is then 


assigned the scale value of the one that it matches most closely. 


If greater accuracy is desire 
set of standard samples by tw 
their judgments averaged to give 

Product scales have been usec 
drawing, and manual arts. 
a of skill in which a permanent tangible product 


d, each sample may be compared to the 
o or more judges independently and 
the final valuc. 

1 for such performances as hand- 


writing, sewing, They are potentially 


applicable to any are 
is the end result. 


ACHIEVEMENT TEST BATTERIES 


widespread program of achievement testing is 
ment test batteries. In the follow- 
r of comparisons of the follow- 


Perhaps the most 
that based upon survey achieve 
on, we will make a numbe 


ing discussi 
as indicated: 


ing batteries, with publishers 
California Achievement Tests, California Test Service 

Coordinated Scales of Attainment, Educational Test Bureau 

Towa Every Pupil Tests of Basic Skills, Science Research Associates 
Metropolitan Achievement Tests, World Book Company 

National Achievement Tests, Acorn Publishing Company 

Stanford Achievement Tests, World Book Company 


286 ACHIEVEMENT TESTS 


These batteries represent "package" achievement testing programs 
ready-made for the schools’ use. The typical battery is made up 
of from four to cight or ten separate tests covering the core knowledge 
and skill segments of the curriculum. We shall examine the content 
of several batteries in more detail presently. The attempt of the 
authors and publishers is to produce an integrated instrument that 
ement testing needs of the typical com- 


will cover the general achie 
munity. 

The chief virtues of the single battery of tests, as compared with 
a program made up of separate tests chosen from a variety of differ- 
ent sources, are those of unity and of convenience. A test battery 
is unified in two important respects. In the first place, it is based 
upon a unified and integrated plan. The parts have been selected 
and the content of cach planned with an eye to the whole. Within 
the limits of the professional skill and understanding of the team of 
authors, the product is a unified whole in which the parts fit together 
to cover the range of objectives that they deem important and feas- 
ible to appraise with a standardized test. 

A battery is unified in one other important respect. It has a unified 
set of norms. The norms for all the subtests are based upon the 
same population and expressed in the same form. This makes direct 
comparison between the different subtests possible with a minimum 
of question. We do not have to ask whether our reading test was 
tried out on the same type of group as our arithmetic test, or how 
the standard scores of our spelling test compare with the percentile 
equivalents of our language usage measure. When tests are assem- 
bled from different sources, these problems can be matters of real 
concern. Particularly in the past, when norming populations for 
tests were assembled in a somewhat haphazard manner, the com- 
parability of a grade score of 4.0, for example, from one test to 
another was subject to serious question. The large, broadly rep- 
resentative groups used in norming recent achievement batteries as- 
sure both breadth of representation for the norms as a whole and 
equivalence of meaning from one test to another. 

Of course, the "package" testing program based on a standard 
battery has certain limitations. The chief one is rigidity. Some 
sections of a battery may fit a particular local curriculum better than 


others. Some subtests of one battery may fit modern curricular ob- 
jecti 
The user of the battery gets the good with the bad, “the bitter with 
the sweet." Short of emitting certain sections completely, he must 


whereas another battery may seem better in another area. 


use what the battery offers him, even though in certain respects it 


CONTENT OF ELEMENTARY-SCHOOL BATTERIES 287 


тау not fit his needs, as he sees them, as well as some other specific 
test covering that area. How serious this is the consumer must 
judge for himself when he compares the subtests of the battery that 
he is using or proposes to use test by test with other tests that are 
available for measurement in those same arcas. The general verdict 
of users, particularly in the elementary school, has been that the 
convenient and unified program represented in a survey battery has 
more advantages than drawbacks, and in practice such instruments 


are very widely used. 


CONTENT OF ELEMENTARY-SCHOOL BATTERIES 

When we examine the widely used batteries prepared for the ele- 
find a good deal of similarity in general plan. 
include a test of word knowledge and 


In particular, they uniformly 
one of reading comprehension; they yield a score for arithmetic 
asoning; they have a spelling 


fundamentals and one for arithmetic rea 
test and a section on language usage. Thus, they all cover this com- 
mon ground of basic skills. They provide for combining the vocabu- 
lary and comprehension tests into a single reading score and for com- 
bining the computation and reasoning tests into a single arithmetic 
score. They have some technique for pooling all the parts into a 
, f course, they provide conversions 
and total into some common norm 
nost common for the elementary- 
of percentile or standard score is 


mentary school, we 


total achievement score. And, o 
for translating scores on the parts 
stem. A grade equivalent is 1 
school tests, whereas some type 
generally provided for senior high school tests. 

Difference in Content. However, there are also many aspects of 
variation, The first of these is the inclusion or non-inclusion of sub- 
In the primary grades, practically all 
basic skill subjects of 


tests in the content areas. 
the different batteries are restricted to the 
reading, arithmetic, spelling, and language. But starting at the 
fourth- or fifth-grade level a number add on content material. The 


variety of content subtests is shown in the tabulation below. 


Litera- Soc. 

Battery ture History Geog. Science Si Health 
California * 
Coordinated Scales x x х х 
Толга Every Pupil * 
Metropolitan x s x x 
National ss x x x 
Stanford х x 


* No formal subtests devoted to any of these content areas. 


288 ACHIEVEMENT TESTS 


The tests that do include content subtests can usually also be 
obtained in an abbreviated form, limited to the skill subjects. Ap- 
praisal of the content areas seems to many schools less urgent than 
appraisal of the basic skills that are tools for learning and communi- 
cating in a variety of areas. The content arcas are also likely to vary 
most in what is taught and in the objectives of teaching from onc 
school system to another. Thus, it is very difficult to guarantee that 
a sample of 30 or 40 items dealing with facts or characters from lit- 
erature will appropriately sample the literary experiences that have 
been provided in a particular school system. The content subtests 
need particularly close scrutiny to determine whether they corre- 
spond satisfactorily with the objectives of local instruction. 

A second significant variation is whether the tests do or do not 
make provision for testing such work-study skills as ability to use 
an index, knowledge about different reference sources, and ability to 
read maps, graphs, and charts. The tests range from giving almost 
no attention to these skills (Metropolitan, National, Coordinated 
Scales) through having short subtests included in the general reading 
comprehension area (California) to providing one or more tests ex- 
plicitly devoted to these techniques for getting information (Stan- 
ford, Iowa Every Pupil). The decision to include these skills is based 
upon a recognition of the role of independent information- gathering 
techniques in the repertory of the pupil. An educational program 
that emphasizes pupil-directed activities, individualization of in- 


struction, and diversified activities by different. pupils values the 
pupil's ability to get 


A school : 


and interpret information from various sources. 
tem that values such pupil activities may wish to include 
these skills among those that are separately: 
urement program. 

Though the different. batteries all include the same basic core of 
skill subjects, they differ fairly substantially in the allocation of 
time to the different aspects of the basic skills. The variation can 
be illustrated by a little analysis of the time allocation among the 
six skill subtests for levels of the batteries appropriate for use in the 
sixth grade. The analysis is presented in tabul 


appraised in its meas- 


ar form. 


Р З Arith. — Arith. 
Word Reading Funda- Reason- Spell- Lang. 


Battery Knowl. Comp. mentals ing ing Usage 
California 106; 196; 37% 14% 10% 10% 
lowa Every Pupil 6 30 24 © Y ш 
Metropolitan yj 16 36 26 10 16 
ийни = 22 22 22 11 22 
Stanford 9 18 25 11 12 


CONTENT OF ELEMENTARY-SCHOOL BATTERIES 289 


The variation in time allowance is in some cases fairly markcd. 
Thus, reading comprehension varies from 16 to 30 per cent of the 
total testing time, arithmetic fundamentals from 22 to 37 per cent, 
arithmetic reasoning from 9 to 26 per cent, and language usage from 
10 to 24 per cent. 

Differences in Degree of Subdivision. Another respect in which the 
tests differ is in the degree to which the skill areas are subdivided 
into shorter subtests yielding separate scores. A number of the 
tests (Metropolitan, Stanford, National, Coordinated Scales) are con- 
tent to deal only with total score on the six tests in the skill areas. 
No further analysis is provided for, and the user is encouraged to 
combine specific tests, i. e., the two reading tests and the two arith- 
metic tests, into broader totals for interpretation. Other tests, by 
contrast, provide for a number of subscores based upon very short 
tests. Thus, the California Achievement Test at the elementary level 
provides for the reporting of 18 separate scores. Some of these scores 
are based on as few as 10 items or as little as 3 minutes of test- 
What are the pros and cons of such a fractionation of the 


ing time. 

test? 
The alleged value of an extensive set of subscores is that these 
provide a maximum of diagnostic information about strengths and 
And it is true that the narrower the segment of skill 
that we study, the more specific our information will be about what 
But the fractionation is inevitably ac- 


the individual is able to do. 
v shortening of the separate parts, and this shortening 


inevitably leads to lowered reliability. We get low reliability ina 
would-be diagnostic measure just where we stand most in need of 
high reliability. (See the discussion on pp. 178-181.) The result is 
that we create a test profile with a number of ups and downs, many 
quite without significance. By showing all this detail, 
to try to interpret the differences, and since 
e matters the attempt to interpret them is 
a waste of time and possibly a source of misguided remedial efforts. 

There is one way in which a set of short and relatively diagnostic 
test scores may perhaps be used to advantage. This is for diagnosis 
on a class rather than an individual basis. The class averages on 
ests may provide a uscful picture of general strengths 
and weaknesses. In the pooled results for the group, chance errors 
nd to cancel out. Even a relatively unreliable 
out group differences. The profile repre- 
the class or for all the classes in a given 
or administrator relatively specific 
a whole and provide cues 


weaknesses. 


companied b 


of which are 
we tempt the teacher 
many of them are chance 


the various subt 


of measurement te 
test is adequate to bring 
senting the average for 
grade тау indicate to teacher 


strengths or weaknesses for the group as 


290 ACHIEVEMENT TESTS 


for needed class activities. It is in group rather than individual 
diagnosis that relatively brief and unreliable subtest scores appear 
to have some function. 

Differences in Item Form. Within the corresponding subtests of 
the different achievement batteries, there is a good deal of variation 
in the specific task presented to the examinee. The variation can be 
illustrated in relation to the spelling subtest, in which it is perhaps 
most marked. The manner in which the spelling task is presented 
to the examinee is indicated below. 


Battery Form of Spelling Exercise 

California Identifying the misspelled word in a set of four differ- 
ent words. 

Coordinated Scales Identifying the correct one among four variant spellings 
of a word, 

Towa Every Pupil Identifying the correct one among four variant spellings 
of a word. 

Metropolitan Writing the word from dictation in a sentence. 

National Writing the word from dictation in a sentence. 

Stanford Identifying the correct one among three variant spell- 
ings. 


The potential user will need to examine the types of tasks presented 
by each subtest of a battery he proposes to use, comparing the test 
with its competitors and judging which type of item best represents 
the outcomes that he is trying to achieve. 


BATTERIES FOR HIGH-SCHOOL ACHIEVEMENT 


The batteries that have been discussed so far are for the elementary 
school and junior high school. It is at these levels that survey bat- 
teries have been most widely used. The more departmentalized and 
specialized program of the high school and college appears to call 
more for specific tests in particular subject areas. The Cooperative 
Test Division of the Educational Testing Service markets a range of 
such specific achievement tests all tied to a common score scale. 

There are, however, several comprehensive batteries at the sec- 
ondary and higher levels. Summary information on four of these is 
presented in Appendix II. The four batteries suitable for secondary- 
school use have in common tests of content knowledge in natural 


sciences, social studies, and mathematics. Three of them also pro- 


vide an evaluation of achievement in English, tending to emphasize 
correctness and effectiveness of expression. The Towa Tests of Edu- 
cational Development go beyond the fields of content knowledge and 


undertake to appraise abilities to locate, read, and understand ma- 


USE TO EVALUATE THE CURRICULUM 291 


terials in the different subject areas, thus attempting to test ability 
to get and use knowledge as well as the amount of knowledge already 
obtained. Tests of this sort were found especially useful in evaluat- 
ing the educational level of individuals much of whose education had 
occurred outside the usual school setting, specifically soldiers in 
World War II who had acquired various amounts and tvpes of 
training while in military service. Evidence is reported by the test 
authors that score on this battery predicts college achievement at 
least as well as grades during 4 vears of high school. 


USING THE RESULTS OF A SURVEY BATTERY 


Since the survey achievement battery is one of the two or three 
most widely used types of standardized test, it is fitting that we 
consider ways in which the results from this testing may be used and 
appraise the soundness of each. Various things are done with the 
results from achievement testing, some useful, some relatively futile, 
and some perhaps positively harmful. Let us examine some of the 
possibilities. One possibility, of course, is that the tests are just 
given, scored, incorporated in some type of summarizing report, and 
filed away. This is one of the forms of futility referred to above. 
We shall dismiss this possibility and assume that at least something 
will be done with the test results. Let us examine various uses that 


might be made of them. 


USE TO EVALUATE THE CURRICULUM OF SCHOOL OR SCHOOL SYSTEM 

As part of a total appraisal of the effectiveness of its program, a 
school system may well wish to include measures of progress in basic 
skills, An achievement battery provides a convenient tool for doing 
this. The results will show how well the particular school or school 
system has progressed on the several components of the battery in 
relation to the norming groups. However, in interpreting this prog- 
ress, three cautions must be borne in mind. 


1. The evaluation is only partial, not complete. The battery can 
give information only on the range of skills that it covers, and these 
skills represent only a fraction of the objectives of the modern school. 
Because they are so conveniently measurable, they may become over- 
valued. This is an insidious danger. The school system must seek 
to supplement standardized achievement tests with broader and 
More informal appraisals of other objectives if it is to obtain a well- 


rounded evaluation of its program. 


292 ACHIEVEMENT TESTS 


2. Local emphases may differ from those that characterized the 
national sample. The particular school system may have placed 
heavier emphasis upon reading or may have delayed the introduction 
of formal instruction in arithmetic. In so far as local emphasis and 
effort are atypical, local accomplishment may be expected to be 
atypical. Evaluation of achievement in the single school or system 
must take account of distinctive local emphases. 

3. Evaluation of pupil performance in a school must take account 
of the characteristics of the pupil population. Schools, communi- 
ties, even regions differ in the economic and cultural level of the 
population served. Associated with these differences are differences 
in average level of ability as measured by our intelligence tests. 
The expectancy for achievement must be tempered to take these 
factors into account. 


USE TO PLAN THE PROGRAM FOR A CLASS GROUP AND THE PUPILS IN IT 


Every fall each teacher in most schools faces a new group of 
pupils. Within the limits set by the course of study (which may in 
some instances be quite rigid limits) he must plan a program of 
activities for the group as a whole and must adapt that program as 
best he can to each of the children in the group. He must decide 
where to pick up the various skill subjects, how much time to devote 
to review of materials presumably taught in the previous year, and 
how fast to move ahead. He must plan appropriate enrichment ex- 
periences and materials for independent work and free time. He will 
probably want to form informal groupings within the class for work 
together at a common level. 

To do these things he needs to get to know the pupils in the group 
as quickly, thoroughly, and accurately as possible. Administration 
of a standard achievement battery is an efficient way of laying the 
foundations for that picture of the class which will permit him to 
adapt his plans to the individuals with whom he has to deal. The 
scores will provide a guide as to whether the group as a whole 18 
superior, average, or retarded in cach of the basic skills he is trying 
to develop. They may indicate group areas of relative strength and 
weakness. They will pick out the children who could profit from 
more challenging tasks than those presented to the class as a whole, 
those who need less demanding materials, and those who should be 
considered for special help either within the classroom or through а 
remedial teacher if one is available. 

It should be understood that this function of informing the teacher 
about his pupils is not to depend on tests alone. Every contact 


UNDERSTANDING THE INDIVIDUAL PUPIL 293 


with the children helps the teacher to get a “feel” for the class group 
and the pupils in it. А richness of understanding of individual 
pupils can only come from working with them as persons. But the 
set of standard test scores provides an objective reference framework 
within which to sce the rest of the picture of the class and the pupils. 
This function can, of course, be served by tests given the preceding 
spring and forwarded to the teacher when he meets the class in the 
fall. Technically, test results from spring testing would be quite 
not likely to shift around greatly 


serviceable, since pupils’ skills are 
But one suspects that tests given 


during the few summer months. 
fall will seem more current and alive and will be more 


early in the 
er in determining his plans for the class. 


likely to be used by the teach 


USE TO IDENTIFY INDIVIDUALS FOR MORE DETAILED STUDY 

One function of an achievement battery is to help screen out a 
fraction of the group of children for more intensive study. Though 
every child should be studied as an individual, there are in every 
school system some children more in need of special help than others. 
symptom is failure to progress in school 


In those cases in which the 
first identified by poor performance on a 


skills, the problem may be 


standardized test. 
Gross irregularities in performance on different subtests, perform- 


ance far below his age or grade level, or performance far below his 
aptitude as indicated by an intelligence test are cues suggesting 
further study. But they are only cues. They are only symptoms sug- 
gesting that something may be wrong. The significance of the symp- 
tom must be investigated further. In the first place, the educational 
achievement must be related to a measure of aptitude to see that 
the child is falling behind what should be expected of him. Where 
i nt that is at issue, his achievement should be 
aptitude test not involving reading. 
Then if the deficiency appears to be a specific retardation in some 
school skill, further diagnostic procedures need to be applied to de- 
and causes of the deficiency. 


it is reading achieveme 
related to performance on an 


termine the exact nature 


UNDERSTANDING THE INDIVIDUAL PUPIL 

1v and remedial activities may be possible for 
е a class, the school and teacher have the 
child as well as possible so as to pro- 
for him in his present school activ- 
1 of educational achievement 


d in understanding and guid- 


Though special stuc 
only part of the children in 
responsibility of knowing every 
vide the best possible guidance 
ans for the futurc. Leve 


ities and in pl 
ire that is neede! 


is one facet of the pictu 


294 ACHIEVEMENT TESTS 


ing each pupil. Appraisal of present adjustment, planning for future 
education, and counseling about a life career can all be helped by 
information about educational progress. 


MAKING UP CLASS GROUPS 

In a large school where there are enough pupils to fill several 
classes in a grade or several sections in a subject, some procedure 
must be adopted for assigning children to particular groups. There 
has been extended controversy as to whether and to what extent it 
is desirable to group together in a single class children of similar 
ability. We cannot examine here the issues involved in that debate. 
Some consideration was given to that problem in Chapter 9. How- 
ever, if the basic decision has been made to {гу to achieve homo- 
geneous groups within a single classroom, average level of achieve- 
ment on a standardized achievement battery provides one type of 
evidence that may be used in such grouping. If over-all level of 
achievement is considered together with aptitude, a considerable re- 
duction in the heterogeneity of the group can be produced. Perfect 
homogeneity will not be obtained (nor should it be desired) because 
of the irregularity of the separate accomplishments of children all of 
whom have the same total achievement, but the variabilitv within 
the single class can be somewhat reduced. i 


EVALUATING THE TEACHER 


It is reported that in some school systems a standardized achieve- 
ment battery is used, either openly or covertly, to evaluate the suc- 
cess of the teacher. He is judged by the performance his class shows 
on standardized tests given at the end of the school vear. He is 
expected, with varying degrees of unrealism, to bring his class "up 
to the norm" on these tests. 

This procedure seems questionable at best, and quite possibly 
vicious. It fails to take account of a number of important considera- 
tions. In the first place, the achievement of a class group is a func- 
tion of their whole previous educational history, not merely of the 
year just past. It is unreasonable to hold the teacher who has taught 
a group for a single year solely responsible for their present status. 
In the second place, achievement depends upon aptitude and upon 
out-of-school cultural experiences as well as upon schooling. Unless 
the evaluator is prepared to make an appropriate adjustment for 
the intellectual and socio-economic level of a particular class апі 
class groups can differ widely in these respects—no reasonable base- 
line can be provided for evaluating what the teacher has accom- 


SUMMARY STATEMENT 295 


plished. In the third place, the skills measured by an achievement 
battery represent only a fraction of the objectives of a modern school. 
Comparison of teachers with respect to this partial criterion neglects 
much of their work and may provide a very unfair evaluation of 
relative worth of two teachers whose strengths lie in different direc- 
tions. Fourth, placing a premium upon easily testable skills in 
evaluating the teacher is almost inevitably going to lead the teacher 
to overvalue those skills in his teaching. As he is judged, so will 
he judge. Skills will tend to become the one central theme of his 
teaching, at the expense of all the other outcomes the school is try- 
ing to achieve. He will, with varving degrees of directness, teach 
for the tests, Finally, one may mention the demoralizing effect upon 
teachers of a mechanical, external evaluation that is subject to all 


the technical limitations discussed above. 


SUMMARY STATEMENT 


The typical standardized achievement test is superficially much like 
an objective test made by the classroom teacher. However, it is 
based on large segments of knowledge or skill common to the pro- 
grams of many schools, and it provides norms. These features 
iately used in making broad comparisons— 


mean that it is appropr 
is, between areas of achievement, or be- 


between schools or class 
tween achievement and aptitude. 

Just as an analysis of the objectives to be 
as the first step in thoughtful construction of a classroom test, so an 
analy of objectives is a prerequisite for evaluating a published 
test. The test can only be evaluated in terms of the objectives that 
the teacher or school is trying to achieve. 

Most widely used standardized tests are survey tests, giving a 
general appraisal of level of accomplishment in a broad area. If the 
teacher is to work constructively with the pupil, such survey results 
mented by more specific and diagnostic information. 
‚ and these can be supplemented 
the reliability of difference 
gnoses is often low. Diag- 


measured was indicated 


need to be supple 
Some published diagnostic tests exist 
by informal teacher appraisals. However, 
ently of differential dia 
considered. quite tentative. 

shop work, or domestic 
a pupil product with 


Scores and consequ 
nostic clues should be 

Certain skills, such as those of handwriting, 
arts, can be appraised effectively by comparing 
а scaled set of standard samples. 

Standardized achievement test batteries are very popular for school 
the advantage of unity in plan and standardization 


is 


use, In these 


296 ACHIEVEMENT TESTS 


must be weighed against the inflexibility of a single total battery. 
The published batteries are similar in general design, though they 
differ in (1) content subjects included, (2) emphasis on work-study 
skills, (3) balance of emphasis among different areas, and (4) specific 
pattern of items in each field. 

When used with discretion and proper reservations, a standardized 
achievement battery can serve a useful purpose as one type of evi- 
dence (1) to evaluate a school's educational program and its several 
components, (2) to help the teacher plan the work of his class and 
the grouping of pupils within it, and (3) to provide an understanding 
of the individual pupil. Standardized test results should rarely, if 
ever, be used as a basis for evaluating the effectiveness of specific 
teachers. 


SUGGESTED ADDITIONAL READING 


Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ters 17 and 18. 

Hildreth, Gertrude H., Metropolitan achievement tests: manual for interpret- 
ing, Yonkers, New York, World Book, 1948. 

Monroe, Walter S., Editor, Encyclopedia of educational research, New York. 
Macmillan, 1950, pp. 874-894, 1461-1478. 

Traxler, Arthur E., et al., Introduction to testing and the use of test results in 
public schools, New York, Harper, 1953, pp. 89-95. 

Traxler, Arthur E., The use of test results in diagnosis and instruction in the 
tool subjects, rev. ed., Educational Records Bulletin No. 18, New York, Edu- 
cational Records Bureau, 1949. 


QUESTIONS FOR DISCUSSION 


ү For which of the following purposes would a standardized test be useful? 
For which should a teacher expect to make his own test? Why? 


a. To determine which pupils have mastered the addition and subtraction 
of fractions. 

b. То determine which pupils in a class are below standard in arithmetic 
computation. 

c. To determine the subjects in which each pupil in a class is strongest and 
weakest. 

d. To determine for a class which punctuation and capitalization skills need 
further teaching. 

e. To form subgroups in a class for the teaching of reading 


2. Examine some standardized reading test. In view of the tasks it pre- 
sents, which of the objectives outlined on pp. 272-273 does it measure ade- 
quately? Which does it measure to some extent? Which does it fail to measure 
at all? 


QUESTIONS FOR DISCUSSION 297 


3. Examine a standardized achievement test for a subject that you are 
teaching or plan to teach. Which of the objectives that are important in the 
subject are measured adequately by the test? Which ones are not? 

4. Make a critical comparison of two achievement test batteries for the 
same grade. How do they differ? What are the advantages of each from 
your point of view? 

: 5. What are the advantages and disadvantages of the different tvpes of 
item forms for spelling subtests listed on p. 290? 

6. Suppose you are teaching mathematics in the first year of junior high 
school. List the steps you would take to diagnose the achievement level of 
the pupils and plan for remedial instruction. 

7. The manual of test W states that it can be used for diagnostic purposes. 
What should you look for to determine whether it has any real value as a 
diagnostic aid? 

8. Why should we be specially con 
resulting from a set of diagnostic tests? 


for using and interpreting such tests? 
9. Suppose that you are a college chemistry teacher and are interested in 


the laboratory skills of glass blowing that your students have developed. 
How might you develop a product scale for evaluating their skill? 

10. Before you can make a sound evaluation of the grade equivalents made 
on a battery ‘of achievement tests by a class or pupil, what information do 
you need beside the converted scores themselves? 

11. The town of M gives the Stanford Achievement Tests to pupils in grades 
4 and 6 and records on the cumulative record card only the grade equivalent 
for the whole test. What are the disadvantages of this type of record? 

12. You have given a standardized achievement battery in October to 
your fourth-grade clas What might you, as teacher, do on the basis of the 
results? 

13. In city K, the Metropolitan = 


cerned about the reliability of the scores 
What implications does this have 


{chievement Test is given to all schools in 
April. The average grade level for each class group and for each subject is 
reported to the superintendent of schools’ office, and these results are mimeo- 
graphed and distributed to all schools. What are the gains from this pro- 
cedure, and what are the dangers in it? What changes would you suggest? 

14. In a fourth-grade group you have data from a group intelligence test 
and from an achievement test battery. On what basis would you select indi- 
viduals to receive special remedial work, either in your class or with a special 
teacher? What are the hazards of this procedure? 

15. What should be the role of standardized test ri 
performance of the classroom teacher? 


esults in evaluating the 


Chapter 12 


Behavioral Measures of 
Personality 


The past three chapters have been devoted to measures of ability: 
what the individual can do under test conditions and motivation to 
do his best. We shall move on now to measurement of other aspects 
of personality—to the appraisal of what he will do under the natural 
circumstances of life. Both in our discussions of personality and in 
our efforts to develop instruments of appraisal, we must recognize 
that the person is a unified whole. Any aspects or traits that we 
may separate out are separated out for our convenience. They do 


not exist as separate entities. They are only aspects of or ways of 
looking at the unitary person. However, it is inevitable that we do 
pick the person to pieces to study and understand him. We cannot 
look at everything at once. 

In Chapter 2 we identified five segments of personality; to wit: 

Temperament refers to the individual's characteristic mood, activ- 
ity level, excitability, and focus of concern. It includes such dimen- 
sions 
verted 


cheerful-gloomy, energetic-lethargic, excited-calm, intro- 
xtroverted, and dominant-submissive. 


Character relates to those traits to which definite social value is 
attached. They are the “Boy Scout" traits of honesty, kindliness, 
cooperation, industry, and such. ў 

Adjustment is a term that we shall use to indicate how well the 
individual has been able to make peace with himself and the world 
about him. In so far as the individual can comfortably accept him- 
self and his world, in so far as his ways of life do not get him into 
trouble in his social group, he will be considered well adjusted. 

Interests refer to tendencies to seek out and participate in certain 
activities. 

Attitudes relate to tendencies to accept or reject particular groups 
of individuals, sets of ideas, or social institutions. 

298 


WHAT THE INDIVIDUAL SAYS ABOUT HIMSELF 299 


METHODS OF STUDYING PERSONALITY 


Most of the evaluation techniques we shall consider in this and the 
following chapters have to do with one or more of the aspects of 
personality identified above. To what sources may we go for evi- 
dence on these aspects when we wish to study an individual? First, 
we can see what he does, how he behaves in the real world of things 
and people. Second, we can find out what others say about him. 
Third, we can listen to what he says about himself. Fourth, we can 
observe how he reacts to the world of fantasy and make-believe. 


MEASURES OF BEHAVIOR 

This chapter will be devoted to a discussion of techniques for 
assessing the individual's typical behavior. We shall first consider 
test techniques that result in some product or permanent record and 
yield an objective score. The tests are necessarily disguised, so 
that the individual either does not know he is being tested or does 
not know what aspect of his behavior is being observed. This con- 
cealment is necessary if the individual is to show his /vpical behavior 
rather than his best behavior. Then we shall go on to consider pro- 
cedures for observing the subject and for recording his responses as 


they are seen by an observer. 


MEASURES OF THE OPINION OF OTHERS 

e may be interested in how a person is per- 
Ts he seen as a friendly fellow worker? 
A fair teacher? An industrious pupil? A convincing salesman? А 
generally desirable employee? The opinion of others may be the sig- 
nificant ‘fact in certain settings. It is also a very convenient way of 
getting a summary appraisal of a fellow man. For these reasons, 
rating procedures have been widely used. We shall consider their 


values and limitations in the next chapter. 


For some purposes, w 
ceived by his fellow beings 


WHAT THE INDIVIDUAL SAYS ABOUT HIMSELF 


The procedures so far described are all ext | 
rms of his public actions or in terms of the 


terize the individual in te a 
reactions of others to him. To get at the inner self, we must go to 
the individual himself. One way is to 8 APR ex id is to 
i -tions the i ›е asked in a face-to-face inter- 
incorporate the questions that might be a па fo e 
"personality inventory. The 


view into a uniform questionnaire ог ; н К 
choices the individual makes 1n responding to the set of questions 


xternal" They charac- 


300 BEHAVIORAL MEASURES OF PERSONALITY 


are scored in various ways to provide a picture of him as he reports 
himself. The strengths and weaknesses of these procedures will be 
reviewed in Chapter 14. 


THE WORLD OF IMAGINATION AND FANTASY 


What an individual will tell about himself in response to questions 
is limited by his willingness to reveal himself, his understanding of 
himself, and his understanding of the language in which the ques- 
tions are presented. For this reason, indirect methods have been 
sought to avoid these limitations and permit him to "open up" more 
fully. One indirect avenue is that of fantasy, imagination, and make- 
believe. We may study what the person sees in ink blots, what 
stories he tells about an ambiguous picture, what play scenes he acts 
out with dolls, what he does with paints and modeling clay. These 
materials and others have been used to clicit imaginative productions 
that psychologists have studied as a source of understanding of chil- 
dren and adults. The individual is allowed to express himself through 
play materials or to project his own interpretations into ambiguous 
stimuli, and thus to reveal himself to us. These are expressive and 
projective techniques for personality: appraisal. We shall undertake 
to describe and evaluate them in Chapter 15. 


BEHAVIOR TESTS OF CHARACTER 


The field of character testing is one in which disguised behavicr 
tests of the situational type received carly emphasis. This was 
natural. Traits of character relate to behaviors in which society sets 
up definitions of what is "good" and what is "bad." We can hardly 
expect a child to report his dishonesties, for example, or to show 
them in a test situation in which he knows his honesty is being ob- 
served and appraised. Furthermore, he has probably managed to 
conceal most of his transgressions from teacher, camp counselor, 
or other adult who might be asked to rate him. We are almost forced 
back upon a concealed test to elicit such socially disapproved bce- 
havior. We shall describe in some detail the honesty tests devised 
by May and Hartshorne for the Character Education Inquiry, in 
part for their intrinsic interest and in part because they illustrate 
the virtues and manv of the limit é 
procedure. 

May and Hartshorne developed a comprehe 


ations of this type of measurement 


ө : nsive series of tests of 
honesty. These included situations in which the individual had a 


chance to cheat, situations in which he had an opportunity to lie, 


BEHAVIOR TESTS OF CHARACTER 301 


and situations in which it was possible for him to steal. Some of 
the situations are described below. 

Situation A: Cheating on a test by copying. A test is given dealing 
with some topic related to school work, word knowledge, for example. 
The papers are collected. The next day the papers are passed out, 
and each pupil is allowed to score his own paper when the answers 
аге read aloud. As a matter of fact, however, the papers have been 
accurately scored before they are returned without any marks being 
made on the paper. The amount that the pupil copies in and scores 
his own paper above the correct score is used as an indication of 
cheating. 

Situation B: Cheating on a test by adding оп. А speeded arithmetic 
test is given, and at the end of 2 or 3 minutes pupils are told to stop 
work. However, for several minutes papers are left on their desks 
while the teacher or test administrator talks about something else. 
Later a second test is given and the papers are immediately col- 
lected. When performance on the first testing surpasses performance 
on the second test by a specified amount, this is taken as evidence 
that the examinee added onto his work after the time limit was up 
and before the papers were collected. 

Situation C: Cheating in a game—peeking. The game is illustrated 
in Fig. 12.1. The stunt is to shut one's eyes and put a dot in each 


8 CX 3 
( 


оО 


Fig. 12.1. Aiming fest. (After Hartshorne & May. *) 
circle in turn. Norms are prepared, based upon children tested with 
their view blocked so that they cannot peck. A child who performs 

a ч с 18 
unduly well, as determined by the peek-proof"' norms, is assumed to 


have peeked and helped himself. 


302 BEHAVIORAL MEASURES OF PERSONALITY 


Situation D: Cheating in an athletic contest. As а part of a “field 
day," each child is given a hand dynamometer to squecze as a test 
of strength of hand. Three "practice" squeezes are given, and the 
adult observer notes and later records the best performance on these. 
Then the pupil is told to make additional squeezes "for the record." 
While he makes the squeezes, the adult is obviously busy with an- 
other child and not watching him. The child records his own per- 
formance on a record blank. Since fatigue tends to set in on suc- 
cessive squeezes, it is unlikely that he will show improvement. If 
the performance he reports surpasses his practice squeezes by a speci- 
fied amount, it is assumed that he has been. unduly optimistic in 
recording his performance. 

Situation E: Lying—self-glorification. In this test the child is 
asked a series of questions. Each question has to do with standards 
of behavior that are universally applauded but seldom achieved. 
Thus, one question reads Do you always obey your parents cheer- 
fully and promptly?” and another, “Do you always smile when 
things go wrong?" It is hard to know how many of a set of state- 
ments like this a child might truthfully endorse, but an attempt was 
made to determine this by having groups of graduate students think 
back to their childhood and respond as would have been true of them 
then. The child who marks an exces 
to be not angelic but untruthful. 

Situation F: Stealing. A game is devised which uses a number of 
coins. These are in a box, and one box is passed out to cach child. 
After the game is over, each child is told to put the coins back in 
the box and fasten it ир. The boxes are collected. They have been 
unobtrusively coded, so it is possible to tell which child had which 
box. A check of the coins in the boxes makes it possible to determine 
which children have helped themselves to one or more of the coins. 


sive number of items is deemed 


As can be seen from the brief descriptions, the tests are quite in- 
volved and require rather extensive stage-managing. The details 
of the testing situation seem fairly critical, i.e., kow sure the child 
feels that he is free from observation, the manner in which the 
children are occupied when they are stopped in their work, and so 
forth. And it is crucial that the “security” of the test be maintained, 
for if the true purpose of the test were suspected, examinees could 
immediately conform to the approved social standard, 


EVALUATION OF BEHAVIOR TESTS OF HONESTY 


How reliable and how valid are these situational tests of honesty? 
Reliability estimates are shown in Table 12.1. We can see that the 


EVALUATION OF BEHAVIOR TESTS OF HONESTY 303 


reliabilities of single tests are rather modest, averaging about 0.50. 
In comparison with the aptitude and achievement tests we have 
been considering in the preceding chapters, these reliabilities are 
disappointing. The score of a pupil on any single test of the set used 
by May and Hartshorne would provide only the roughest indication 
of the typical behavior of that child. A single test would need to 
be extended. by adding on several additional tests of the same sort 
if a satisfactorily stable and dependable measure were to be obtained. 
The tests would appear to be useful primarily for the comparison of 


different groups of pupils. 
Table 12.1. Reliabilities of Tests Used for Measuring Deception 


(From May & Hartshorne *) 


Reliability 

Type of Test Coefficient 
1. Copying from a key or answer sheet 70 
2. Adding onto one’s score on a speeded test 44 
3. Peeping when one's eyes should be shut .46 
4. Faking a solution to a puzzle .50 
5. Faking а score in a physic al ability test .46 
6. Lving to win approval 84 
24 


7. Getting illicit help at home 


When it comes to validity, we are put to it to find any outside 
standard against which to evaluate the tests. Teachers’ ratings of 
pupils may be taken as one limited and imperfect criterion, and the 
classroom cheating tests showed a modest correlation with this cri- 
terion (average about 0.35). But before we look for outside criterion 
should perhaps ask how the different kinds of honesty 


measures, we 
tests correlate with each other. 
Considering four different types of cheating tests carried out in 


the classroom situation, the authors found that on the average a 
test of one type correlated with a test of one of the other types only 
to the extent of 0.26. When some type of classroom cheating test 
was correlated with cheating in an athletic contest, the average cor- 
nd to be only 0.16, and with the stealing test the 
The lying test, also given in the class- 
other classroom tests and 0.06 with 


relation was fou 
average correlation was 0.17. 

room, averaged 0.23 with the 
the two out-of-classroom tests. 


Even though the reliabilitie: 
rent sorts of t 
ent settings (i.e., classroom ver- 


cheating versus 


s of the single tests are low, the cor- 
relations between the diffe ests are a good deal lower. 
When the correlations involve differ: 
sus gymnasium) or different types of behavior (1.е., 


304 BEHAVIORAL MEASURES OF PERSONALITY 


stealing), the correlations drop still further. Many of them are not 
far from zero. Whatever these tests are measuring, they are not all 
measuring the same thing. 

These correlations tell us something about the trait of honesty, 
something that may apply to other character traits as well. Honesty 
does not appear as a unitary characteristic of the individual. There 
is no guarantee that a person who is honest in one situation will be 
so in another. The low correlations show us that behavior in one 
situation tells us very little about behavior in another situation 
somewhat different from the first. Honesty is situationally deter- 
mined and socially specified. For the individual child there is not 
a single honesty; there are honesties. And this is true of adults. 
Not pecking in a card game, paying one's railroad fare even when 
the conductor forgets to ask for it, reporting every penny of income 
on the income tax return, not distorting the facts as to the size or 
number of fish one caught—all these represent different behaviors 
only loosely related to one another. 

The low correlations among the honesty tests tell us something 
about the organization of behavior in the character domain. It 
appears to be only rather loosely organized. They also pose for us а 
problem as far as measuring the trait is concerned. Clearly, no one 
test can provide an adequate appraisal of honesty-in-general. It tests 
only honesty in one specific setting. It cannot provide an adequate 
description of an individual. We would require a battery of tests 
such as May and Hartshorne had. But even in this case, we might 
question the meaningfulness of a single honesty score, since it would 
be made up of so many loosely related specifics. A single over-all 
score on an achievement test battery would provide a very blurred 
picture of the nature of a child's school achievement. | 


en more 50, 
a single honesty score would blur many variations in specific behav- 
lor patterns. 

The low correlations among specific honesty tests make it neces- 
sary to include a number of separate tests if we hope to get an ade- 
quate representation of different honesties. Because of this fact, to- 
gether with the complexities of testing procedure, the use of behavior 
tests of character has been limited largely to research projects: 
They have not been adapted to any extent for routine use in schools 
or for any type of personnel selection. 


SELECTED RESULTS FROM MAY-HARTSHORNE STUDIES 


As research tools, behavior tests have provided a wealth of inter- 
esting data, notably in the original studies of May and Hartshorne. 


SITUATIONAL TESTS AND ASSESSMENT PROGRAMS 305 


Some of the more interesting findings from these studies are sum- 
marized below. Readers are referred to the original studies for 


details. 


1. Honesty was essentially unrelated to age or sex over the range 


of grades studied. There was no tendency for children to learn to be 


more honest as they got older. 
2. The more intelligent children received higher honesty scores. 
Of course, school pressures were probably less severe for brighter 


children. How much the difference in behavior reflects a difference 


in motivational pressures cannot be determined. 

3. Honesty was associated with socio-economic status, children 
from higher socio-economic levels evidencing less dishonesty than 
those from lower levels. 

+. Siblings resembled one another in honesty, and this resemblance 
Was more than could be accounted for by familial resemblances in 


intelligence or by the common socio-economic background. 

5. Children in a school following progressive educational practices 
cheated less than comparable children in a conventional school 
program. 

6. The children within a school as a whole or a class group within 
a school tended to resemble one another in level of honesty dis- 
played. There appeared to be a factor of school or class morale. 

7. There was no indication that children who participated in or- 
ganized programs of religious education or who were members of 


groups expressing character education aims were more honest than 


non-participants or non-members. 


SITUATIONAL TESTS AND ASSESSMENT PROGRAMS 


During and since World War II a number of assessment programs 
a comprehensive appraisal of candi- 


have been set up for making 
assignment, Perhaps the 


dates for a particular type of training or 
i ; the program set up to screen personnel 


during World War П. The pro- 
some features of it will be worth 
ams have generally made use of 


most publicized of these w 
for the Office of Strategic Services 
gram has been fully described,’ and 


Assessment progr 
s for evaluating the individual. They 


detailed interviews, and 


considering here. 
à wide variety of technique 
ts of several sorts, 

and projective materials. However, one 
iational test, in which the individual 


have included ability tes 
various types of fantasy 
central element has been the sitt 1 | 
andardized task situation where his be- 


is placed in a more or less st 


306 BEHAVIORAL MEASURES OF PERSONALITY 


havior can be observed, his responses recorded, ог various aspects 
of his reactions rated by observers. We shall first describe some of 
the situational tests used by the OSS and then comment more 
briefly on another assessment program developed as a procedure 
for selecting clinical psychologists. 


SITUATIONAL TESTS IN THE OSS ASSESSMENT PROGRAM 


For assessment by the OSS staff, each candidate was brought to 
state near. Washington, 
D. C., for a 3-day period of testing and evaluation. During this 
period he was continuously under observation and was subjected to 
a wide range of tests and stresses. In addition to ability tests of a 
number of kinds—tests of intelligence, mechanical ability, ability to 
observe and remember details—he was exposed to a number of 
“situational” tests. These consisted of staged situations, with 
fairly complete instructions and ground rules, presenting problems 
that the candidate was to solve, either individually or as a member 
ofa group. The variety of situational tests used in the program was 
wide. Selected examples are described briefly in the following para- 
graphs. * 


an assessment center, the chief опе a large e 


The Brook. Individuals worked in teams composed of five or six 
men. The group was brought to a stream about 8 feet wide. On 
the banks were a log, a rock, various boards, ropes, a pulley, and 
other items. They were instructed somewhat as follows: 


In this problem you have to use your imagination. Before you you see a 
raging torrent so deep and so fast that it is quite impossible to rest any- 
thing upon the bottom of the stream. The banks are sheer, so it will be 
impossible for you to work except from the top of them. 

You are on a mission in the field, and having come to this brook you 
are faced with the task of transporting this delicate range-finder, skill- 
fully camouflaged as а log. to the far bank, and of bringing that box of 
percussion caps, camouflaged as a rock, to this side. In carrying out this 
assignment, you may make use of any materi л 
When the job is done, all of vou, as well 
are to be back on this side. 

The limits within which you are to work are marked by the two white 
stakes on each bank (the stakes were approximately 15 feet apart), and 
you are not permitted to jump across the stream, ` 


als vou find around here. 
аз апу material you have used, 


The behavior of each man in the group was observed, as the group 
went about solving the task, and each man was rated on such factors 


* For fuller descriptions and verbatim instructions, the reader is referred to 
The Assessment of Men. 


SITUATIONAL TESTS IN THE OSS ASSESSMENT PROGRAM 307 


as energy and initiative, effective intelligence, social relations, lead- 
ership, and physical ability. 

Construction. A single candidate was presented with the task of 
building a structure, using materials resembling oversized Tinker 
Tov. His task was described to him, and then he was told that the 
test was primarily one of leadership, since the work was to be done 
by two helpers whom he was to supervise. The "helpers" were called 
in, апа the construction project began. However, the “helpers,” 
who were carefully coached assistants, turned out to be sensitive, 
stupid, and obstructive, and their behavior ranged from ‘‘gold- 
bricking" to systematically heckling their supervisor. Again, the 
responses of the examince to these frustrations were observed, and 
he was rated for emotional stability and leadership. 

Improvisations. This test was one of role-playing. Working in 
Pairs, examinees were assigned roles in a dramatic situation and were 
told to enact the scene as they would handle it in real life. Thus, 


one situation was set as follows: 


A moved to a small city about 3 months ago and opened a business 
there. He has been doing quite well and one month ago sent in an appli- 
cation for membership in a club in the town. He has heard nothing in 
response to this application and goes to the home of B, a prominent mem- 
ber of the club, with whom he is pleasantly acquainted. (A is then sent 
out of the room and В is told that А has received several blackballs. А 
is then called back into the room.) 

As before, aspects of each individual's personality, as they exhibited 
themselves in the role-playing situations, were rated by observers. 


Stress Interview, Candidates were instructed to assume that the 


following situation had occurred. 

A night watchman at 9:00 р.м. found you going through some papers 
in a file marked "SECRET" in a Government office in Washington. 
You are NOT an employee of the agency occupying the building in which 
the office is located. You had no identification papers whatsoever with 
you. The night watchman has brought you here for questioning. 


The examinee was given 12 minutes to prepare a cover story to ac- 
count for his presence in the compromising situation. Then he was 
subjected to an intensive and grueling interrogation, in which his 
Statements were questioned, inconsistencies brought out, and every 
attempt made to trip him up and to make him feel foolish. He was 
rated on the quality of his story and his ability to maintain it and 
upon his evidence of emotional stability. 


Further examples of situational tests might be cited, but these 
serve to show the essential characteristics of this type of approach 


308 BEHAVIORAL MEASURES OF PERSONALITY 


to personality appraisal. The attempt is made to develop situations 
that approach realistic lifelike situations but still permit a reasonable 
staff considered 


amount of control from person to person. The OS 
desirable characteristics of situations to be that they (1) have а 
number of alternative solutions, (2) do not require highly specialized 
abilities, (3) reveal kinds of behavior that cannot be registered by 
mechanical means, (4) force the candidate to reveal dominant dis- 
positions of his personality, (5) involve interaction with other per- 
sons, and (6) require the coordination of numerous components of 
personality. Situations can be planned to elicit the types of be- 
havior in which the experimenter is interested, but for evaluation of 
the behavior he is largely thrown back upon observations and 
ratings. 


EVALUATION OF OSS SITUATIONAL TESTS 


Situational tests of the sort used in the OSS differ from the May- 

Hartshorne character tests in that, though they still deal with be- 
havior in a somewhat disguised situation, they do not yield an actual 
record or product. Thus, in the May-/Jartshorne stealing test it 
was necessary only to count the coins left in the box to determine 
the examinee's score. The tests were highly objective as far as the 
scoring was concerned. Tests of the OSS type are not objective. 
Though an attempt is made to present a standard task situation, 
the evaluation of each examince’s behavior is through the observa- 
tions and ratings of the staff of examiners. 
\ The gain from this approach, which offsets the loss in objectivity 
is a great increase in the range of behaviors that can be studied. 
Much that the individual does, especially in his relations with others. 
leaves no record once the behavior is past. An action showing a£- 
gression or resistance to domination, an integrating suggestion that 
promotes group harmony, assumption of the initiative, or lapsing 
into passive followership are actions that occur and are gone. We 
must observe them on the wing if we are to get them at all. This is 
what the situational test hopes to achieve—to provide the situa- 
tions that will elicit behavior of this sort and to provide for its 
immediate observation and rating. 


Situational tests appear to be adaptable to eliciting a variety of 
types of social and emotional behavior that have resisted measure- 
ment by any more objective form of test. However, they present à 
number of problems. A program involving a number of situational 
tests is costly. The tests are likely to be costly in the facilities and 
arrangements they require. They are almost certain to be costly in 


ASSESSMENT OF CLINICAL PSYCHOLOGISTS 309 


the time of professional personnel to supervise their administration 
and to evaluate the behavior exhibited in the test situation. The 
staging of the situations may call for a certain amount of dramatic 
skill on the part of the examiners, and there is a real problem in 
maintaining the uniformity of the situations from individual to in- 
dividual and from group to group. Another problem is that of pre- 
venting leakage of information about the test tasks, so that the task 
is a novel one to each group as it is tested and is approached by each 
group with the same background. In view of the practical difficulties 
involved, it is not surprising that the use of situational tests has been 
limited to rather elaborate assessment programs, arranged for evalua- 
tion of special types of personnel—undercover agents, clinical psy- 
chology trainees, or business executives. 

The actual value of situational tests and, in fact, of the whole 
elaborate assessment program remains somewhat of a question. Psy- 
chologists who have participated in the programs are, in many cases, 
enthusiastic. Whether the information that is elicited has value in 
predicting important facts about the individual is another matter. 
In the OSS program, it was possible to obtain only a limited amount 
of evidence on the extent to which men who had gone through the 
assessment program turned out well in their job assignments. Rat- 
ings from overseas colleagues and evaluation by commanding officers 
were obtained in a fraction of the cases. Predictions of success did 
correlate significantly with success on the job. The evaluation that 
showed the highest correlation was rating for effective intelligence. 
ffective intelligence based on the complete 3-day 


The final rating for e З 
lation with rated success оп 


program had a somewhat higher corre 
the job than did scores based on a brief objective test of verbal 
ability, but the difference was not great. 

The question of the reliability of the appraisals based on the 
situational tests did not receive svstematic study in the OSS assess- 
ment program. The program was an operational enterprise, not a 
research. project, and time was not available for making analytical 
studies of the elements that were incorporated in the total program. 
Judgments in any one setting were more or less contaminated by 
ssions gained in previous tests. Evaluations by different ob- 
matically kept separate for study. 


impre 
servers were not syste 

THE ROLE OF SITUATIONAL TESTS IN THE ASSESSMENT OF CLINICAL 
PSYCHOLOGISTS 
Another assessment program, organized for the purpose of select- 


ing trainees for a Veterans Administration clinical psychologist train- 


310 BEHAVIORAL MEASURES OF PERSONALITY 


ing program, also included situational tests? Among these tests 
were: 


1. Improvisations (see p. 307). 

2. Discussion situation test, in which individuals formulated sepa- 
rate plans and then participated in a group discussion to arrive at 
a common plan for dealing with a social problem. 

3. Block situation test, involving a group problem-solving task in 
which a set of heavy blocks had to be moved and organized accord- 
ing to an initially unknown set of principles. 

4. Expressive movement situation test, in which the examinee had 
to express the feeling of a poem and of a series of discrete words, 
using his body but not speaking. 


In each situation, the candidate was observed by one or more ob- 
servers, and his behavior was rated with respect to his predicted 
competence in different aspects of the work of the clinical psycholo- 
gist. 

Criterion ratings were obtained from university and field-work 
supervisors some 3 vears after the original assessment. The pooled 
ratings of probable success based on a series of situational tests cor- 
related 0.24 with a final rating for clinical competence and 0.26 with 
a rating of preference for hiring. A standard intelligence test gave 
correlations of 0.35 and 0.20 with these two criterion ratings. The 
ratings based on the situational tests were 
as the intelligence test alone. 
succe 


about as good predictors 
In this investigation, the predicted 
5 of the candidates was appraised after each additional item 
of information was added to the pool. Thus, the initial prediction 
was based only on the information in the individual's credential file. 
A second prediction was based upon this information plus the record 
of objective ability tests, and further predictions were made as addi- 
tional elements of information were added оп. The validities of the 
predictions at different stages were as follows: 


Later Rating On 


Clinical Preference 
M 1 Competence y Hiring 
Individual rating based on credential file © ДУ 
only e 0.31 0.17 
Data from objective tests added on 0.37 0.28 
Personal autobiography also available 0.40 0. 33 
Analyses of projective personality tests j i 
(see C hapter 15) also included 0.40 0.34 
Intensive personal interview carried out 0.42 0. 37 


Records from situational tests included 0.39 0.34 


SYSTEMATIC OBSERVATION 311 


The coefficients shown above do not permit one to assess the valid- 
ity of each procedure taken separately, but they do show what each 
one adds to the pool of information already available. In this case, 
the situational tests appeared to add nothing to the validity of the 
prediction that could be made from the other items. As a matter of 
fact, given the individual's credential file, including his previous 
academic record, recommendations, etc., his objective test scores, 
and the autobiographical statement he had written, the prediction 
was about as good as it was with the further addition of all the more 
elaborate interview, projective test, and situational test data. There 
is little evidence that these time-consuming procedures added in- 


formation of value in this instance. 


EVALUATION OF SITUATIONAL TESTS 


In summary, then, the situational type of test is an interesting 
additional tool for personality assessment. It seems to provide a 
direct opportunity to see the individual functioning in lifelike situa- 
tions and thus to appraise a variety of aspects of leadership, coopera- 
tion, and social functioning. However, evidence for the value of 
the results as improving our prediction of the individual's success 
on the job is largely lacking. Because its practical value has not 
been demonstrated and because the techniques are costly in prepa- 
rations required and in the time of testing personnel, situational 
testing must be considered a subject for research at the present time, 


rather than a proven tool for personnel evaluation. 


SYSTEMATIC OBSERVATION 


The situational test has introduced us to observation as a tech- 
nique for studying the typical behavior of the individual. Observa- 
tion in that instance was of what he did in specified test situations. 
We turn now to observation in the naturally occurring situations of 
everyday life. The situations of everyday life are probably less uni- 
form from person to person than the test situations that we stage. 


Also, thev are not loaded to bring forth the behaviors in which we 
are specially interested. However, the very naturalness of real life 
events and the fact that we do not have to stage special events just 


for testing purposes make observation of natural situations appeal- 


ing to us. , 
Of course, we observe the people with whom we associate every 
day of our lives, noticing what they do and reacting to the ways in 


which they behave. Our impressions of people are continuously be- 


312 BEHAVIORAL MEASURES OF PERSONALITY 


ing formed and modified by our observations of them. But these ob- 
servations are casual, unsystematic, and undirected. If we are asked 
to document with specific instances our judgment that John is a 
leader in his group or that Henry is undependable, we will usually 
be put to it to provide more than one or two concrete observations 
of actual behavior to document our general impression. Observa- 
tions must be organized, directed, and svstematic if they are to yield 
dependable information about an individual. 

Systematic observational procedures have been most fully devel- 
oped in connection with studies of voung children. They seem 
particularly appropriate in this setting. On the one hand, the young 
child has not developed the covers and camouflages to conceal him- 
self from public view as completely as has his older brother or sister, 
so there is more to be found out by watching him. On the other 
hand, he is less able to tell us about himself in words. So it has been 


in the study of nursery-school children that observational procedures 
have had their fullest development. 


STEPS TO IMPROVE OBSERVATIONAL PROCEDURES 


Many of the early studies of young children were accounts of the 
development of a particular child or of two or three children based 
on observations by a psychologist parent. These provided a general 
descriptive background for understanding the young child, but they 
were qualitative and lacking in precision. Careful research with the 
child or investigations to determine the effect of particular preschool 
environments or experiences require that we know not merely that 
he shows negativism and res | 


: stance, for example, but also how much 
or how often. The needs of measurement, as distinct from those of 


qualitative description, require observational procedures that will 
permit a statement of quantity, of amount. The procedures should 
be as objective and reliable as possible, with a minimum of depend- 
ence upon the whims and idiosyncrasies of the individual observer: 


Го accomplish this, several precautions were typically undertaken. 
These are discussed below. < ] 


1. Selecting the Aspect of Behavior to Be Observed. Опе problem 
of the general observer of human behavior is that lie does not know 
what he is looking for. So much is happening in any situation in- 
volving one or more active human beings that some 1381 of it must 
inevitably be missed. We cannot notice кыы that happens, 


we cannot record evervthing that we notice, In anv program 
of systematic observation, we must first select. certain aspects or 


STEPS TO IMPROVE OBSERVATIONAL PROCEDURES 313 


categories of behavior to be observed. Thus, in a study of nursery- 
school pupils, we may be interested in aggressive behavior and may 
limit ourselves to instances of aggressiveness. In a research project 
to evaluate a school program, we may be interested in observing 
evidences of cooperation or of independently initiated activity and 
may restrict our observation to these. 

Defining the Behaviors That Fall within a Category. If we turn two 
observers loose without further ado to observe the occurrence of 
“aggressive acts" or "nervous behavior” in preschool children, we 
will find that there are many disagreements between them in the 
observations they make. Our categories must be further specified. 
They must become more behavioral if we are to get good agreement 
between observers. What is an "aggressive act," a "nervous habit"? 
Do we wish to includc name-calling in the first instance? Fidgeting 
in one's seat in the second? Just as we must analyze "ability to 
get and interpret data" into specific testable skills of using an index 
or making inferences from a bar chart, so must we translate "ag- 
gressive acts" into hitting, kicking, biting, pushing, grabbing, name- 
An advance agreement on what is to be in- 


calling, and the like. 
cluded, based upon prior studies of the domain in question, is a neces- 
sary condition of objective and reliable observation. 

Even with a carefully defined set of be- 


3. Training Observers. 
sagreements arise between observers. 


haviors to be observed, ¢ 
Some of these are unavoidable due to fluctuations of attention or 
variation of scoring on close judgments. Others can, however, be 
eliminated by training. Practice sessions in which two or more ob- 
servers make records of the same sample of behavior, compare notes, 
discuss discrepancies, and reconcile differences provide one means of 
Practice sessions watched by and criticized 


increasing uniformity. 
by an already trained observer represent another. Such procedures 
make for uniformity of interpretation and standard application of 
the observation categories. 

4. Quantifying Observations. If observations of some aspect of the 
child's behavior, his aggressive acts or his social contacts, for example, 
аге to provide a measurement of the child, some form of quantifica- 
tion is required. The quantification usually takes the form of count- 
ing. The count тау be of the number of times that a child shows a 
particular form of behavior during a period of observation. How- 
ever, in this case one often has difficulty in deciding when one cat 
ends and the next one begins. Johnny slaps Henry and then kicks 
him. Is this one aggressive act or two? If the actions flow over from 


one to the other, the decision may not be an easy one. 


314 BEHAVIORAL MEASURES OF PERSONALITY 


Ап expedient that has appeared to work well in a number of cases 
has been to break the period of observation up into quite short seg- 
ments. These may be no more than a minute or even half a minute 
in length. Then the observation that is made is merely the occur- 
rence or non-occurrence of the particular category of behavior during 
each small segment of time. Thus, we might observe each child for 
ten 5-minute periods, each on a different day. The 5-minute periods 
might each be subdivided into ten 14-minute periods. For cach of 
the 14-minute periods we would observe whether the particular 
child did or did not exhibit any of a set of defined aggressive behav- 
iors. Each child would then receive a score, with a possible range 
from 0 to 100, indicating the extent of his overt aggressiveness. Such 
scores, based on an adequate number of short samples of observed 
behavior, have been found to show quite satisfactory reliability. 
Thus, Olson found ê the reliability for twenty S- minute observations 
of children's nervous habits to be 0.87 in one case and 0.82 in 
another. 

5. Developing Procedures to Facilitate Recording. An cssential for 
accurate observational data is some procedure for immediate record- 
ing of what was observed. The errors and selectivity of memory enter 
in to bias the reporting of even outstanding and unusual events. In 
the case of the rather ordinary and highly repetitive events that are 
observed in watching a child in preschool, for example, an adequate 
account of what was observed is only possible if the observations are 
recorded immediately. There is so much to see 
much like others that to rely upon те 
after-the-fact account of a child's beh 
the case in any attempt at complete 
we shall find a place for selective 
ing of significant incidents of be 
taken place. 

Any program of systematic obsery 
some technique for immediate and efficient recording of the events 
that are observed. There are many possibilities for facilitating re- 
cording of behavior observations. One that has been widely used 
has been to develop a systematic code for th 
that are of interest. Thus, preliminary obse 
to define the range of aggressive acts that can be expected from 3- 
and 4-year-olds. Part of the code might be set up as follows: / = 
"hits," p = "pushes," g = “grabs materials away from," n = “calls 
a nasty name," and so forth. A record blank can be prepared, 
divided up to represent the time segments of the observations, and 


and one event is so 
mory to provide an accurate 
avior is fatal. This is certainly 
and systematic recording, though 
observation and anecdotal record- 
havior some time after they have 


ation must, therefore, provide 


€ categories of behavior 
rvations will have served 


NURSERY-SCHOOL CHILDREN'S SOCIAL BEHAVIOR 315 


code entries can be made quickly while the child is observed almost 
without interruption. 

If the observer is skilled in standard shorthand, of course, fuller 
notes of the observation can be taken. These can be transcribed 
and coded or scored later. In some cases, where a research project 
has liberal financial backing, more complete photographic or sound- 
tape recordings of the observations may make possible a permanent 
record of the behaviors in a relatively complete form. These records 
can then be analyzed at a later date. Such resources are likely to be 
the exception, however, and in many cases it will be necessary to 
plan a simple and efficient code to provide an immediate and perma- 
nent record of what was observed. The important objectives here 
are to do away with dependence on memory, to get a record that 
will preserve as much as possible of the significant detail in the origi- 
nal behavior, and to develop a recording procedure that will inter- 
fere as little as possible with the process of observing the child. 


ILLUSTRATIVE STUDIES USING DIRECT 
OBSERVATION 


The ways in which direct observation has been used in studying 
aspects of the child's behavior and the impact of educational experi- 
ences upon him can best be indicated by selected illustrations. We 
have chosen three examples of quite different types of observational 


procedures and quite different problems to illustrate the applications 


of direct observation. 


NURSERY-SCHOOL CHILDREN'S SOCIAL BEHAVIOR 

First let us describe a study of the social behavior of nursery- 
school children and the impact of nursery-school experience on that 
Social behavior. This study ? deals with a group of 18 children in 
one nursery school. Nine of the children were "veterans" who had 
sry school before; the other nine were “novices” who 


attended nurs 
The procedure was one of direct 


Were at school for the first time. 
observation, in which a running diary account was kept of a child's 
One child was observed at a time. 


activities on the playground. 
In the fall, the observation period was 15 minutes and there were 
ten such periods. However, the record was kept by separate half- 
minutes and could be analyzed by half-minute units. An additional 
series of eight 5-minute observations was carried out with each child 


in the spring. 


316 BEHAVIORAL MEASURES OF PERSONALITY 


The behavior that was particularly observed was "social contacts." 
A social contact was taken to mean “апу occasion which, to all 
appearances, involves an actual interchange between two or more 
children; this includes all cooperative organized play, sharing of ma- 
terials or activity, physical contacts, or conversation." The records 
were scored to determine the per cent of half-minute periods during 
which a social contact occurred. Observations were made inde- 
pendently by two observers, so it was possible to determine observer 
consistency as well as the consistency of the child from day to day. 

We shall report brieflv some of the quantitative data that are of 
interest in characterizing the observational method as used here. 
The reader who is interested in details of the substantive results and 
particularly in qualitative materials and interpretations is referred 
to the original study. 

The two observers agreed in their report of an item of behavior 
in approximately 95 per cent of the instances. That is, there were 
about 5 per cent of the items that were reported by one observer but 
not by the other. The reliability of the score for a child based upon 
ten 15-minute periods of observation was estimated (by split-half 
procedures) to be about 0.80. In the fall, the “novices” showed social 
contacts in only 26 per cent of the periods, on the average, whereas 
the "veterans" showed contacts in 47 per cent. 
age for both groups was 58 per cent. 
School experience gave the “veteran” 
this was apparently completely ove 
end of the school year. However, 
spring score was 0.60 for the ‘ 
Individual rank in the 
period of time. 


By spring the aver- 
Though previous nursery- 
group a large early advantage. 
rcome by the “novices” by the 
the correlation of fall score with 
veterans” and 0.82 for the “novices.” 
group showed marked stability over this 


AN EVALUATION OF A SCHOOL 


Our next illustration of the 


“ACTIVITY PROGRAM” 

use of direct observation deals with 
an attempt to evaluate certain changes in curriculum and procedure 
in the New York City elementary schools. In 1935 the Board of 
Education of New York City introduced into a number of schools 
an experimental "activity program” in which units of pupil activity 
were to supplement, and in a measure replace, the traditional text- 
book learnings. As a part of the total experiment, a project was 
undertaken to evaluate on a broad base the effects di this program. 
as compared with the standard program that had been in effect and 
was continued in many “control” Schools. In addition to paper-and- 
pencil tests of abilities and skills, both conventional standardized 


AN EVALUATION OF А SCHOOL "ACTIVITY PROGRAM" 317 


tests and some of a more experimental character, classroom observa- 
tions were carried out. The hope was that the observations would 
get at some of the differences between the schools that were not well 
represented in paper-and-pencil tests. 

The categories of behavior that were observed were: 


1. Cooperative Activities: helping pupils or teacher: offering objects to 
teacher, pupil or visitor; responding quickly to requests for materials, 
quiet, etc. 

2. Critical Activi 


praising or challenging work of others; defending 
own point of view: asking pertinent questions of teacher or other pupils. 

3. Experimental Activities: trying out new things, putting things to- 
gether in new combinations: creating or constructing an original poem, 


story, diagram, instrument, ctc. 
4. Leadership Activities: organizing. directing, or controlling persons. 


5. Recilational Activities: responding to question on assigned text: vol- 
unteering answer on assigned text or subject matter. 

6. Self-Initiated Activities: bringing voluntary contributions to class ac- 
tivities; submitting data gathered outside school: presenting a report on 
a self-directed investigation; suggesting methods, materials, etc. for de- 


veloping a project. 
7. Negative Work-Spirit Activities: wasting time; requiring undue help; 


not working when teacher is absent: leaving paper on floor; etc. 


\ 


Behavior was observed during a series of half-hour periods by an 
observer seated unobtrusively in one corner of the room. The ob- 
server undertook to observe the whole class at the same time, cod- 
Under these circumstances, some 


ing behaviors as they occurred. 
However, analysis of a 


items of behavior were certainly missed. 
number of periods of joint observation by two independent observ- 


ers showed that there was substantial agreement in the items that 
were recorded. The percentages of agreement were reported as 


follows: 
Per Cent 
Recitational activities 90 
Cooperative activities 82 
Self-initiated activities 87.5 
Critical activities 82.5 
Leadership activities 100 
Experimental activ 96 
Negative work-spirit activities 61 


liability was made by correlating fre- 
ctivity for successive samples of ob- 
of 10 periods). Correlations were 
s and for individual scores of 


A further analysis of ге 
quency of a particular item of a 
servation (first 5 versus second 5 
computed for class averages of 51 cla 


318 BEHAVIORAL MEASURES OF PERSONALITY 


1833 pupils. The estimates of the reliability of a score based on a 
complete series of ten 30-minute periods of observation were as 
shown. 


Individual 

Class Total Pupil 
Cooperative activities 0.88 0.56 
Critical activities 0.82 0.54 
Experimental activities 0.44 0.39 
Leadership activities 0.90 0.32 
Recitational activities 0.68 0.46 
Self-initiated activities 0.60 0.47 
Work-spirit activities 0.94 0.54 


Thus, although in most cases this procedure gave scores having 
fairly satisfactory reliability for studying the group as a whole, the 
procedure was quite undependable when it came to providing a score 
for a specific individual. Fortunately, in the context in which the 
results were used it was groups rathe 
the matter of primary concern. 

The reader may be intere 
from the observational proc 


r than individuals that were 


sted in a brief summary of the results 
edures as far as differences between the 
two educational programs were concerned. 
Showed consistently. and dependably more critical, experimental. 
leadership, and self-initiated activities. The "control" classes 
showed consistently and dependably more recitational behavior. Dif- 
ferences in cooperative and negative work-spirit activities were 
small and inconsistent from one semester to the next. Readers in- 
terested in other aspects of similarity or difference between the two 
groups are referred to the original report of the investigation. 


The "activity" classes 


BEHAVIORAL EFFECTS OF A UNIT ON COM 
The third and final illustration sh 


be used to aid in evaluating the outcomes from a specific unit of 
instruction? The unit was a 6 weeks’ one on communicable diseases 
and was taught in a high-school course i 
of the unit were not mere 
of factual information an 


MUNICABLE DISEASES 


ows how direct observation may 


n biology. The purposes 
ly to provide the pupils with certain items 
d certain generalizations and understand- 
ings, but also to change their overt behavior in ways approved by 
health officers. Certain frequently occurring behaviors were selected 
for observation. Undesirable behaviors included putting fingers in 
the mouth, putting other objects in the mouth, biting fingernails. 
putting fingers in the nostrils, rubbing an eye with a finger. A de- 
sirable behavior that was observed was using one's handkerchief 
when one sneezed or coughed. 


BEHAVIORAL EFFECTS OF A UNIT ON COMMUNICABLE DISEASES 319 


Observations were carried out on all the pupils in a class group 
at one time. Each period of observation was 20 minutes in length, 
and there were ten periods in a series. One series of observations was 
carried out prior to teaching the unit; one, immediately after teach- 
ing the unit; and one, after a lapse of 12 weeks. Concurrent ob- 
servations were made on control groups who were taught a unit 
dealing with quite different content. Observations were carried out 
both in the biology class, in which pupils had received the actual 
instruction, and in English class, to check on the generality of any 
changes produced. As in the previous studies, some observations 
were carried out with a check observer to test the reliability of the 
procedure, and very high consistency was indicated. 

Results are summarized in Table 12.2. The fact that teaching had 
an impact on pupils’ behavior is brought out dramatically by this 
table. In the experimental group a number of disapproved behaviors 


Table 12.2. Behavioral Changes in High-School Pupils Taught a Unit on 
Communicable Diseases Compared with Those for a Control Group 
(From Urban“) 


Frequency 


Experimental Group Control Group 
Pre- Епа оѓ After Pre- Endof After 
Behavior test Unit 12 wks test Unit 12 wks 
In Biology class: 
Put fingers in mouth 692 163 172 717 739 813 
Put objects in mouth 237 16 217 As 5 
Bit fingernails 44 M 6 61 90 18 
Put finger in nostril 60 8 2 27 51 48 
Rubbed eye with finger 90 25 30 67 96 129 
Used handkerchief to "ER n ae 
cough 0% 82% 19% 0% 0% 3% 
Used handkerchief to 
sneeze 15% 84% 86% 1% 12% 8% 
In English class: 
Put fingers in mouth 907 283 327 941 888 860 
Put objects in mouth 2390 111 107 241 304 ol 
Bit fingernails 63 7 D + 91 88 
Put finger in nostril 41 7 5 = р 22 
Rubbed eye with finger 155 60 48 196 181 184 
Used handkerchief to ; » 
cough 0% 78% 76% 0% 1% 05 
Used handkerchief to x 
indkerchief t 0% 4895 59% 0% 0% 0% 


sneeze 


320 BEHAVIORAL MEASURES OF PERSONALITY 


all but disappeared and the esteemed actions came strongly into the 
picture. The changes carried over in large measure at least to an- 
other classroom setting, and persisted with little change 3 months 
after the unit was over. No comparable effects appeared in the con- 
trol group. Evidence of this sort is an important supplement to 
tests of the conventional types, if we wish to get a well-rounded pic- 
ture of the effects of our teaching. 

The three illustrations we have just sketched in have shown 
methods of direct observation used in rather different ways and ap- 
plied to quite different problems. These examples are representative 
of many specific studies from among which the selection was made. 
They suggest the range of usefulness of this way of studying the 
individual person or groups of persons. 


EVALUATION OF SYSTEMATIC OBSERVATION AS AN 
APPROACH TO PERSONALITY MEASUREMENT 


We have described the nature of systematic observation, outlined 
some of the precautions necessary if the procedure is to be satisfac- 
torily reliable and objective, and illustrated the application of the 
method to three quite different sorts of rese 
us undertake an appraisal of the method, 
strengths and some of its limitations as 
sonality of children. 


arch studies. Now let 
indicating some of its 
a way of studying the per- 


ADVANTAGES OF DIRECT OBSERVATION 


Procedures based on direct observation of the behavior of others 
have a number of features that make them attractive as personality 
evaluation devices. Some of the more 
sidered below. 

A Record of Actual Behavior. When we observe an individual, we 
get a record of what he actually does. We are not dealing with his 
rationalizations and protestations. If our observational procedures 
have been well planned and our observers carefully trained, our score 
is in large measure free from the biase: i 
particular observer. Our record of the individual is not a reflection 
of what he thinks he is, or of what someone else thinks he is. His 
actions speak to us directly. If, as will be true s, Our 
concern is in what the person does or the 
has been changed, then obsery 
and in many ways the most 
information, 


significant points are con- 


s and idiosyncrasies of the 


in many 
way in which his behavior 
ation of his behavior is the most direct. 
satisfying, way of getting the relevant 


LIMITATIONS OF DIRECT OBSERVATION 321 


Applicable in a Natural Situation. One great advantage of obser- 
vational techniques is that they can be applied to the naturally oc- 
curring situations of life. Observation is not restricted to a test 
situation, though we saw when we were describing situational tests 
that observation is often an important adjunct to a test situation. 
Observation can be carried out in the nursery school, in the class- 
room, in the cafeteria, on the playground, at camp, in the squadron 
day room, or anywhere individuals work or play in a public setting. 
There are, as we shall point out presently, practical difficulties and 
limitations that arise in stage-managing the observations. But in 
direct observation is a widely applicable approach to 


spite of these, 
operating in a normal non-test 


studying individual personalities 
setting. 

Usable with Young Children and Others for Whom Verbal Communi- 
cation is Difficult. Observation is possible with small children, no 
matter how young. As a matter of fact, the younger the child, the 
casier it is to observe him. The infant is completely unselfconscious, 
and we can sit and watch what he does with no special procedures or 
precautions. With older children, it becomes necessary cither to 
screen the observer from the subjects being observed or adapt them 
to him. The observer may be 
so that he can see the child or children but they cannot see him. 
However, the requireme ical setting seri- 
ously restricts the situations in which observation may be done. 
More often, and more simply, the observer may be present long 
enough and function sufficiently unobtrusively so that the subjects 
come to pay no attention to him, accepting him as a natural part of 


separated by a one-way vision screen 


nt to provide such a phys 


the surroundings. 

The value of direct observation is greatest where its application is 
most feasible— with young children. Young children are quite limited 
in their ability to communicate through language. They do not 
: c or facility in analyzing or reporting their 


have much experienc 
feclings or the reasons for their actions. They are often shy and 
For these groups especially, direct obser- 


resistant with strangers. 
of approach. 


vation provides an important avenue 


LIMITATIONS OF DIRECT OBSERVATION 
scribed contribute to the attractiveness 


The factors we have just de г 
for studying individuals. How- 


of direct observation as a technique 
ins the answer to all our measurement problems. 


isly limit the usefulness of observational 


ever, it is by no mec 
A number of factors seriot 


322 BEHAVIORAL MEASURES OF PERSONALITY 


techniques. These range from very practical and down-to-earth con- 
siderations, which we shall consider first, to more fundamental 
theoretical issues. | 

Cost of Making the Observations. Observation is costly primarily 
in the demands that it makes on the time of trained observers. In 
the illustrations we gave, each child or class was observed for a mini- 
mum of 3 hours and for a maximum that extended well beyond this 
amount. When observations are to be made of a substantial num- 
ber of individuals or class groups, the hours rapidly mount up. 
Systematic direct observation and recording of behavior is for this 
reason alone limited in its use to research projects, in which the 
necessary time commitments can be made. It is not practical in 
routine school operations to find the manpower required to make 
direct observations routinely of each pupil. 

The cost of direct observation lies not merely in the observer time 
required in making the recordings. Any form of special setting or 
any form of mechanical recording represents 
Furthermore, when the original re 
of activities, a motion picture of 
discussion or conversation 
time-consuming. 

Fitting the Observer into the Setting. 
whether having an observer in 


an additional cost. 
cord is a running diary account 
activities, or a sound track of a 
‚ analysis of the records is also likely to be 
There is always a question of 
any setting, watching and making 
notes of what goes on, will actually change what happens. In many 
of the situations one wishes to observe, it is not practic 
the observer invisible. 


al to have 
One hopes, often with justification, that 
after an initial period of getting used to the observer all persons 
being observed will take him completely for granted and ignore him. 
However, this is casier in some situations than others. When the 
group is small, where it is necessary for the observer to follow its 
activities very closely, or when the group meets for too short a time 
to get used to being observed, the members may not be too success- 
ful in coming to think of the observer as a piece of the furniture. 
Eliminating Subjectivity and Bias. 

are used, it is found necessary to use 
the observer's interpretations 
Our objective is to have the observer function purely: 
instrument that is sensitive to, and makes a record of, certain cate- 
gories of behavior. Most of the precautions we described on pp. 
312-315 are directed toward that end. 
tially successful. The observer is alw 
mize his influence, but we 


When observational procedures 
all possible precautions to keep 
and biases out of the observation. 


as a recording 


But at best we are only par- 
avs human. We may mini- 


cannot eliminate it. Especially when the 


LIMITATIONS OF DIRECT OBSERVATION 323 


phenomena we are studying are complex or involve an element of 
interpretation, we must beware of the role of the observer in the final 
result. 

Determining a Meaningful and Productive Set of Behavior Cate- 
gories to Observe. Any observation is selective. Only certain limited 
aspects of the individual’s behavior can be observed and recorded. 
Furthermore, if observations are to be treated quantitatively they 
must be classified, grouped, and counted. Any system of classifica- 
tion is a more or less arbitrary framework that we impose upon the 
infinitely varied events of life. It is not always easy to set up a 
framework that serves our purposes well. Thus, the reader may 
very well feel that the categories of behavior described on p. 317 
do not indicate important outcomes of a progressive school program 
or that the types of activities included under a given heading are 
inappropriate to that heading. Or we may have classified aggres- 
sive acts in terms of the overt behavior, hitting, pushing, or grabbing, 
whereas for our purposes it might have been better to classify by 
the precipitating event (if we could observe it): aggression in re- 
sponse to conflict over property, or as a reaction to verbal disparage- 
ment, or after thwarting of some activity in progress. In any event, 
scores based upon observations of behavior can be no more signifi- 
cant and meaningful than the categories we have devised for analyz- 
ing that behavior. 

Determining the Significance of an Isolated Item of Behavior. Be- 
cause of the need to achieve reliability and objectivity, the tendency 
has been to focus observation upon rather small and discrete acts, 
or at least to break the analysis of observational material up into 
small and discrete acts. There is a real danger that when this is done 
the meaning of the behavior, the true significance of the action, will 
be lost. Thus, we observe that 3-year-old A hugs 3-vear-old B. 
Is this an act of affection? Or is it, as seems frequently the case at 
this age, an act of aggression? If the observation stands alone, we 
have no wav of telling. Or suppose that A hits B. This is fairly 
clearly an aggressive act, but what does it signify in the life economy 
of A? Is its healthful and adjustive reaction to earlier domination 
by B, a bursting of bonds that have shackled A? Orisit a displaced 
aggression built up by domination at home by a parent or older sib- 
ling? Or does it signifiv any one of a number of other things? 

The External Character of Observation. What the illustration we 
brings out is that observation is external. The 
erated when little bits of behavior are analyzed 
“outsideness” is a fundamental feature of 


have just given 
"outsideness" is exagg 
out of context. But the 


324 BEHAVIORAL MEASURES OF PERSONALITY 


any observational approach to studying behavior. We always face 
the problem of determining the meaning of the behavior and must 
recognize that what we have before us is only what the person does, 
not what it signifies. 


INFORMAL OBSERVATION—THE ANECDOTAL 
RECORD 


The systematic and continuous observations of pupil behavior 
that we have considered in the previous section are essentially re- 
search tools. They are too time-consuming to be practical and usu- 
ally too specialized to be useful to classroom teachers trying to 
build up a better understanding of their pupils. However, every 
teacher is observing his pupils from day to day, and there is no 
reason why those observations should not be informally recorded as 
a guide to his own increased understanding or to that of others who 
will later deal with the pupils. Such reports of informal teacher 
observations of pupils have been called anecdotal records. 

But why should the observations be recorded? Who should be 
observed? What should be recorded? How should the records be 
kept? What steps should be taken to organize and summarize them? 
What problems are commonly encountered in making and using 


anecdotal records? These are some of the points we shall need to 
consider. 


WHY MAKE A RECORD? 


Of course teachers learn from observing pupils, but why record 
the separate observations? Why not trust to the teacher's memory 
to summarize in his own mind the observations he makes from day 
to day and allow him to report his eval 
descriptive statement or set of ratings 

The answer lies partly in the fallibility of human memory and 
the inadequacy of human beings as assemblers and combiners of facts 
about another person. We make many observations of other people, 
but we make so many that they all melt together in our memory. 
and only a rare few of the Most striking experiences continue to 
retain their individuality. Even these become warped and distorted 
with the lapse of time. And the way the 
torted, together with the flavor of the 
recollections of ordinary day-to-day с 
the rememberer as on the event.“ 
flavored by all our ingrained prejudi 


uation of a pupil in a term-end 


sharper memories are dis- 
stew that is made of our blurred 
xperiences, depends as much on 
Our general reaction to a child, 
ces and warped by what we have 


WHO SHOULD BE OBSERVED? 325 


heard about him or by our initial experiences with him, provides the 
framework into which our observations are fitted. 

A further reason for not relying upon summary impressions of a 
child, as reported by the teacher at intervals, is that such reports 
have not generally proved too informative or useful. They are in 
general terms, often moralistic in tone, evaluating rather than de- 
scribing, and telling more about the teacher's reactions to the child 


than about the child. 


s hard as she should. She seems to 


Wilhelmina does not work nearly 
be a bright child, and does well when she really tries. She can be very 


annoying at times. 


What do we now really know about Wilhelmina? What chance do 
we have of understanding her or of working with her more effec- 
tively? We have a fairly good picture of a teacher's dissatisfaction, 
but know very little of a factual nature about the child. 

record of an observation of child behavior, a prompt 
record while the behavior is still fresh in the mind, can be a correc- 
tive for the limitations and distortions of memory. Such a record 
to provide a relatively direct and objective 
reactions of the observer kept down to a 


Making a re 


can come, with practice, 
report of actions, with the 
minimum. 
During art class Wilhelmina was very slow in starting work. She 
stopped her own work several times during the period to wander around 


the room and look at what other pupils were doing and tell them what 
Was wrong with their pictur Mary and Jane each told her to mind 


her own business and leave them alone. 


es. 


This notation provides an item of factual information that can help 
rstand Wilhelmina. It is a specific excerpt of 


us to know and unde i 1 
а number of others, it may yield a 


behavior. Put together with r 
factual and meaningful picture of the child. 


WHO SHOULD BE OBSERVED? 
Anecdotal records may serve 
A first purpose may be to give te 


With a view to deepening their unde Я 
Sympathetic insights. If the records are serving as part of an in- 


Service educational program in child study, it may be vell to con- 
centrate observations on two, or at most three or four, pupils. This 
will permit a completeness of observation and a fullness of reporting 
that would not otherwise be possible. The children will ordinarily 


two rather different sorts of purpose. 
achers practice in studying children, 
rstandings and increasing their 


326 BEHAVIORAL MEASURES OF PERSONALITY 


be selected for observation in terms of the teacher's special interest 
in them. However, it would probably be unfortunate to focus ex- 
clusively or even primarily upon "problem" cas There is much 
to be learned and much light to be cast on the child with special 
problems by studying the "normal" child, with his or her normal 
problems, quirks, and idiosyncrasies. 

In schools in which anecdotal records have become a part of the 
basic cumulative record system, anecdotes should be reported for 
each child in the class group. In this case, it will naturally be neces- 
sary to be content with a much more limited sample of observations 
for any single child. 


WHAT SHOULD BE RECORDED? 


This question divides into two. Which incidents should be 
made a matter of record? What should be included in the record 
of each? 

Items Worthy of Recording. Anecdotal recor 
formal and largely qualitative picture of cert 
dividual's behavior. 


s provide an in- 
ain aspects of an in- 
There is no point in using them for aspects of 
his behavior that can be appraised by 
methods. Intellectual ability, academic achievement, and creative 
skills are better shown by standardized tests on the one hand or 
by pupil products on the other. It is primarily aspects of social 
functioning or adjustment to person 
illuminate by records of incide 


more objective and accurate 


al problems that one hopes to 
nts of school behavior. The interac- 
tions of a child with the other children in the room, evidences of ac- 


ceptance or rejection, aggression or withdrawal, events that throw 
light on the child's role in the g 


material for our pen. Indic 


Toup and his reaction to it are ht 
ations of 
tions to them, habitual mood 


justments аге worth recordin 
can this incident tell a re 
ance worker, 


personal tensions and adapta- 
and temper, or special crises and ad- 
g. We may ask in cach case: What 
ader who does not know the child а guid- 


a subsequent teacher- 
some other way. 


Material to Be Included in a n Anecdote, 
be an accurate f 


—that he could not find out in 


An anecdotal record should 


actual report of an event in a child's life, reported 


with enough of the setting and enough det 
ful item of behavior. Such a report is f 
perience with teachers who are 
about their pupils indicate 
from the prescription we h 


ail so that it is a meaning- 
ar from easy to prepare. EN- 
Starting to try to write anecdotes 
s that there are three 


dations 
common deviation 
ave given, 


WHAT SHOULD BE RECORDED? 327 


1. The anecdote evaluates, instead of reporting. It tells the teach- 
er's reaction to the child. John was a very difficult child today” 
is a report of how the teacher felt about John, not what he did. 

2. The anecdote interprets, instead of reporting. It gives the teach- 
er's conclusions as to the reasons for behavior, instead of or as well 
as a report of what actually occurred. For example, we may see an 
item that reads: "Oscar simply cannot keep still in class now. He is 
growing so fast that he is restless all the time." The second sentence 
is pure interpretation, based upon extremely meager evidence, as 
far as we can tell. It tells us nothing about what happened. Expla- 
nations and interpretations are all very well in their place, if they are 
kept tentative and thought of only as hypotheses for further testing. 
But they should be clearly distinguished from description. The 
primary function of an anecdotal record is to describe a child's be- 
havior. 

3. The anecdote describes in general terms, rather than being specific. 
A report of this type would be the following: "Mary is not well ac- 
She usually stands on the 


cepted by the other children in the class. 
games." This sum- 


sidelines at recess and does not take part in the 
marizing statement may be of some value in providing a picture of 
the child. However, it lacks the objectivity and concreteness that 
characterize the description of a single specific event. It incorporates 
valuation than we would like in our basic 


More of selection and e 
raw material. 


In contrast with the three variations indicated above, a good anec- 


dotal record has the following features: 

description of a specific event. 

2. It describes the setting sufficiently to give the event meaning. 
3. If it includes interpretation or evaluation by the recorder, this 
from the description and its different 


1. It provides an accurate 


interpretation is separated 
status is clearly identified. 

+. The event it describes 
development or social intera 

5. The event it describes is either representative of the 
behavior of the child or significant because it is strikingly different 
from his usual form of behavior. If it is unusual behavior for the 


child, that fact is noted. 


is one that relates to the child's personal 


ctions. 
typical 


are presented as conforming fairly 
Note that no attempt is made to 
The emphasis is on ease of 


The following three anecdotes 
Well to the above specifications. 


phrase the anecdotes in full sentences. 


328 BEHAVIORAL MEASURES OF PERSONALITY 


recording rather than on grammatical elegance. The reader may 
find it worthwhile to check them off point by point and see how he 
would like each changed to make it a more useful and meaningful 
piece of data about a child. 


Class: 5A Pupil: Henry K. Date: 3/15/53 

Class working as a group, Richard serving as chairman, discu sing 
plans for class exhibit for local “Visit Your School Week." Henr s hand 
up and trying to talk almost continuously. Interrupted other children 
four or five times. Interruptions largely c 
When Richard told him he was “out of or 
talking, he said, “Aw, nuts to you," 
discussion. ] 

(Typical of Henry's behavior a number of times lately. Aggressively 
seeking attention, then withdraws if rebuffed.) 


ustic or facetious comment. 
der" because someone else was 
and paid no more attention to the 


Class: 8B Pupil: Peter Y. Date: 4/25/52 

Peter drowsed off in social studies dis 
Far-away look; then eyes closed. e 
Seemed attentive for fe 
out the period. 


cussion period after lunch today. 
ame to with a start when spoken to. 
w minutes, then dropped off again. Sleepy through- 
(Same sort of thing several times in past two weeks. Is something pre- 
venting him from getting enough sleep? What is it?) 


Class: 6B Pupil: Betsy R. Date: 10/6 52 

Coming into class after morning recess Betsy sl 
known. While getting seated, had a row with 
pencil. Later in morning, pinched Ellen, 
before lunch. Standing by herself 
girls. 

(Very unusual for Betsy. l 
center of the group.) 


apped Sue, reason un- 
Jane about ownership of a 
Two or three other squabbles 
after lunch, not playing with other 


sually even tempered, well liked, and the 


HOW SHOULD ANECDOTAL RECORDS BE KEPT? 


The exact mechanical format of 


anecdotal records is of secondary 
importance compared with 


considerations of content. How- 
depend upon the case with which 
ar pupil may be assembled, studied, and 
also important that the 
dens of keeping the records be kept to 
practical problems in the use of 
cal problems they impose. 

The appropriate form for ke 
primary purpose for which they 
serving to guide the teache 
they may well be ke 


the 
ever, the usefulness of records wil 


the records for a particul 


summarized. It is sheer mechanical bur- 


a minimum. One of the main 
anecdotal records has been the cleri- 


ping records will depend upon the 
are being kept. If the records are 
r's study of two or thr 


се particular pupils, 
pt in the form of two or th 


тее separate logs or 


PROBLEMS IN MAKING AND INTERPRETING ANECDOTAL RECORDS 329 


diaries. Successive entries should then be dated and entered in se- 
quence in a notebook, on sheets of typewriter paper, or on file cards. 
When records are being made from time to time on all the pupils 
in a school, as a part of the regular cumulative record system of the 
School, a uniform method of recording that facilitates filing the 
records of each child in his individual file folder will be needed. 
The record form should be evaluated in terms of the total record 
system. If an individual file folder is used, an 815 x 11 sheet of 
paper will often prove suitable. The record form should provide 
space for identifying information (class, pupil, data, person making 
the record), the anecdote itself, and possibly for an evaluating com- 


ment. 


WHAT SHOULD BE DONE TO ORGANIZE OR SUMMARIZE RECORDS? 

Each original anecdotal record is an item of information about an 
individual. A series of records provides a whole set of such items. 
But for data to be useful they must be organized, summarized, and 
interpreted. The data in such an intelligence test as the Binet con- 
sist of a series of responses to specific items that are summarized in 
а mental age ог I. O. Although the significant elements in a set of 
anecdotal records cannot be summarized as simply, some attempt at 
bringing the items together into an organized picture of the individ- 
ual will usually be desirable. А р 

At intervals, perhaps once a semester or possibly oftener if a child 
is being studied intensively, the anecdotes on an individual should 
be reviewed carefully. Recurring patterns should be noted. Any 
Progressive changes should be brought out. A thumb-nail sketch of 
the individual, as shown by the anecdotes, should be prepared. The 


attempt should be made to relate the anecdotal material to other 


facts that are known about the child: his health, intellectual ability, 
academic achievement, home surroundings, and family pattern. A 
tentative interpretation of the patterns may be attempted, if it is 
recognized that any interpretation is to be thought of as a set of 
: z In the summary, as in the records them- 


very tentative hypotheses. ! ‹ ‹ 
interpretation of it should 


Selves, the descriptive summary and the 
be kept clearly differentiated. 


WHAT PROBLEMS ARISE IN MAKING AND INTERPRETING ANECDOTAL 


RECORDS? 

We have already indicated a number of the problems in making 
and using anecdotal records in the previous sections. These and 
some other issues will be summarized in this section. 


330 BEHAVIORAL MEASURES OF PERSONALITY 


Problems Arising out of the Selection of Items. The number of 
anecdotes that could be written about апу individual is almost lim- 
itless. The written record must consist of a relatively small fraction 
of these, chosen by the observer as being significant or as tvpical of 
the child. The quality of the accumulated data depends upon the 
shrewdness and impartiality of this selection. Both the significance 
and the truthfulness of the picture will depend upon the ability of 
the observer to select items to record that are illuminating and truly 
representative. Bias by the observer can easily creep into both the 
selection and the recording of the items. For a child whom the 
teacher dislikes, it is easy to pick out and record only situations in 
which he appears in a bad light. If the teacher is unduly preoccupied 
with academic achievement or an orderly cla. 
lating to non-achievement or disorder may tak 
the record. It is hard to know how much bi 
of anecdotes by selectivity of this sort, 
a very real one. 

Problems Relating to the Phrasing of the Anecdote. Difficulties here 
center around the tendencies, which we have already considered, to 
include evaluation, interpretation, and generalities and to leave out 
the specific factual description. Problems of literary style are also 
occasionally a matter of concern. In this regard, the thing to remem- 
ber is that anecdotes are valued not as literary gems but for the in- 
formation they convey. Brevity and clarity, not literary elegance. 
are the objectives even to the point of writing in phrases rather 
than sentences. 

Problems Relating to the Clerical Burden. One of the most serious 
practical problems in any school Program of anecdotal records is the 
sheer clerical burden of Preparing, filing 
With this problem in mind, 
anecdotal recording cautiously 


oom, incidents re- 
| dominant role in 
as is introduced in a set 
but the problem is certainly 


and summarizing the records. 
any school system should move into 
^ Recording should be tried first for 
a few pupils in each class and gradually expanded. Recording pro- 
cedures should be kept as simple as possible. 
elegance of format should be minimized. 

Problenis Relating to Use. 
anecdotal records are use 
care that the records do 
records must be accessible. 
so that the user can refer to a concise summary. 

One specific problem is a feeling on the part of some teachers 
that they do not wish to be biased by what a pre 
said about a child. When what the 


Literary style and 


Like any other ev 


aluation procedure, 
ful only if they 


are used. One must take 
not become an end in themselves. The 
They must be summarized periodically. 


vious teacher has 
previous teacher has said is pri- 


SUMMARY STATEMENT 331 


marily an expression of his reactions to the child, with a strong ad- 
mixture of personal prejudice, this unwillingness is understandable. 
When the anecdotes and the summary become factual and descrip- 
tive, there is no longer any reason to object to having the information. 
It is as important for the teacher to start the year with information 
about a child's personal and social development as it is to be in- 
formed about his reading and number skills. 


SUMMARY EVALUATION OF ANECDOTAL RECORDS 

An anecdotal record provides a medium for recording the observa- 
tion of a significant item of pupil behavior. When teachers have de- 
veloped skill in selecting incidents and in describing them objec- 
tively, when the mechanics of record-keeping and summarizing are 
kept within reasonable bounds, and when the records are available 
for use by those whose concern it is to understand the individual 
pupil, such records can be a significant aid to working with children. 


SUMMARY STATEMENT 


In personality assessment we are concerned with determining how 
the individual typically reacts to the life situations he encounters. 
Evidence on this mav be obtained from (1) his actual behavior, (2) 
the reactions of others to him, (3) what he says about himself, or 
(4) his reactions to tasks that give relatively free rein to his imagina- 


tion and fantasy. ; 
We may attempt to elicit typical behavior by actual test situa- 


tions, such as those represented by the honesty tests of May and 


Hartshorne. These have the advantage that they can be scored as 
directly and objectively as an ability test. However, the tests are 
complex to develop and stage. have rather modest reliability, yield 
rather specific to the particular test situa- 


results which seem to be 
adaptable to many of the aspects of per- 


Чоп, and are not readily 
sonality in which we are interested. 

The ‘situational test represents a compromise between a standard 
test and an observational procedure. A lifelike test situation is de- 
veloped, into which the examinee is placed. Typically, it isa social 
Situation involving some type of interaction with other individuals 
and structured to emphasize the facets of the personality in which 
the investigator is particularly interested. For evaluation. of the 
examinee's behavior, however, reliance is placed on observation ps 
ratings, This permits a good deal more freedom in De aln 
test situations, and many sorts of interpersonal behavior may bc 


332 BEHAVIORAL MEASURES OF PERSONALITY 


observed. In large measure, however, the actual value of the ob- 
servations that can be made in such settings remains to be demon- 
strated. : 

Behavior in naturally occurring situations has been studied bv 
techniques of direct observation. Steps that have been taken to 
refine the everyday observations we make of people include (1) limita- 
tion of observations to a single aspect of behavior, (2) careful defini- 
tion of the behaviors falling within this categorv, (3) training of 
observers, (4) quantification of observations, as by a procedure of 
taking many short samples, and (5) development of procedures for 
coding and recording the observations. 

Direct observation has the advantages of (1) representing actual 
behavior, (2) being applicable to natural life situations, and (3) be- 
ing usable with young children and others with whom verbal com- 
munication is difficult. However, observational procedures present 
a number of problems, including (1) cost, (2) difficulty of fitting the 
observer into the situation, (3) difficulty of eliminating observer bias, 
(4) difficulty of setting up meaningful 


and productive categories to 
observe, (5) difficulty in deteri 


nining the Meaning of isolated bits of 
behavior, and (6) the fact that an observer inevitably has an outside 
view of the person whom he observes, 

Systematically: scheduled observation is rarely practical for teach- 
ers, job supervisors, or other persons for whom pe 
is secondary to other aspects of their job. 
informal anecdotal records to accumulate factual information about 
a pupil or employee. Informal observations should be factual re- 
ports of significant items of behavior; they should avoid evaluation, 
interpretation, and broad generalizations. Records of observations 
should be kept as simply as possible and reviewed and interpreted 
periodically to give an organized picture 
observed. 


rsonality appraisal 
Such people may use 


of the person who has been 


REFERENCES 
1. Assessment Staff, U. S. Office 
New York, Rinehart, 1948, 


2. Hartshorne, H., and M. A. May, 
1928. 


3. Jersild, A. T., and Магу D. Fite, The influence of nursery school attend- 
ance on children’s social adjustments, Child Devel pm. Monogr., No. 25, 1939. 

4. Jersild. A. T., R. L. Thorndike, B. Goldman, and J. J. Loftus, An evalua- 
tion of aspects of the activity program in the New York City public elementary 
schools, J. exp. Educ., 1939, K. 166-207. B 

5. Kelly, E. I., and D. W., 


psychology, Ann Arbor, Univer: 


of Strategic Services, Assessment of men. 


Studies in deceil, New York, Macmillan. 


Fiske, The Prediction of performance in clinical 
sity of Michigan Press, 1951. 


QUESTIONS FOR DISCUSSION 333 


‚6. Olson, W. C., The measurement of nervous habits in normal children, 
Univ. of Minn. Inst. of Child Welfare Monogr., 1929, No. 3. 

7. Urban. J., Behavior changes resulting from a study of communicable 
diseases, Teachers College Contrib. Educ., 1943, No. 896. 


SUGGESTED ADDITIONAL READING 


Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 


ter 
Arrington, Ruth E., Time-sampling studies of child behavior, Psychol. 
Monogr., 1939, 51, No. 2. 

Biber, Barbara, et al., Child life in school, New York, Dutton, 1942, pp. 
33-53. 

Ferguson, Leonard W., Personality measurement, New York, McGraw-Hill, 
1952, Chapter 14. 

Monroe, Walter S., Editor, Encyclopedia of educational research, New York, 
Macmillan, 1950, pp. 806-817. 

Stalf, Division on Child Development, ZZelping teachers understand children, 
Washington, D. C., American Council on Education, 1945. 

Traxler, А. E., The nature and use of anecdotal records, rev., New York, 


Educational Records Bureau, 1949. 


QUESTIONS FOR DISCUSSION 


1. In their studies of honesty, Hartshorne and May report quite low corre- 
lations between different behavior tests of honesty. If this is true for other 
qualities as well, what does it mean for our understanding of people? 

2. What implications do the findings of May and Hartshorne have for the 
sroom teacher when it comes to writing descriptions or evaluations of 


cle 


students for permanent school record: 
3. Try to plan a number of behavior tests for some trait other than 


honesty. 
4. Plan a situational test for use in a school or industrial situation. What 


Would you hope to get from this test that. you could not get in other ways? 
What would be the difficulties of using such a test as you have proposed? 

5. How could the class discussion that takes place in most classes serve as 
the bas ematic observation? Make a plan for recording these obser- 
vations. 

6. In a research study. systematic observations of school 


children as a method of studying their social adjustment. What problems 
What precautions would you need to take in inter- 


s for s 


you propose to use 


would vou encounter? 
preting the results? 

7. What advantages do systematic obs 
tions have over the observations of everyday life? 


ervations or short sample observa- 
What limitations do these 
more specialized procedures have? 

8. If you are working in a classroom, make anecdotal records on some one 
child over a 1-week period. Observe as vell as you can the guides for making 
anecdotal records given on pp. 326-327. What difficulties did you encounter 


in making the records? 


334 BEHAVIORAL MEASURES OF PERSONALITY 


9. Criticize the following anecdotal records: 


a. "Mary continues to be a nuisance in class. She is noisy and not only 
fails to do her own work but keeps other children from doing theirs. I 
don't know what I am going to do about it." 

b. “John had a good deal of trouble with his arithmetic today. He didn't 
seem to be able to get the idea of reducing fractions to a common denomi- 
nator. Out of several problems he was able to identify the lowest common 
denominator only once." 


Chapter 13 


The Individual as Others See 
Him 


In the last chapter we considered various ways in which the indi- 
vidual's typical behavior in various sorts of situations might be ob- 


served, recorded, and analyzed. A second main way in which an 


individual's personality shows itself is through the impression he 
makes upon others. We are interested in the second person now not 
as an impersonal recording instrument but as a reagent reacting to 
the first personality. How well does A like B? Does A consider B 
a pleasing person to have around? An effective worker? A good 
job risk? Does A consider B to be conscientious? Trustworthy? 
Emotionally stable? Questions of this sort are continually being 
\ teacher, supervisor, former employer, minister, or 
We must now inquire how fruitful it is to raise such 
autions must be observed if the questions are 


asked of ever 
even friend. 
questions and what prec 
to receive useful answers. 
We shall first give brief consideration to the unstructured letter of 
recommendation. Then we shall examine rating scales and rating 
procedures. Finally, we shall consider some special forms of rating: 


nominating techniques and forced-choice rating procedures. 


LETTERS OF RECOMMENDATION 


an impression of one person through 


The most fluid form for getting 
cond person to talk or 


the eves of a second person is to invite the se 
Such a communication could be obtained 


write to vou about him. 
it most commonly 


in any setting. However, the setting in which 
does occur is when person A is a candidate for something: admission 
» or fellowship. a job, membership in a club, 
He then furnishes the institution, placement 
e names of people who know him well or 


and that agency obtains state- 


to a school, a scholarshij 
ог a security clearance. 

agency, or employer th 
know him in a particular capacity. 


ments about A from B and C, who know him. 
335 


336 THE INDIVIDUAL AS OTHERS SEE HIM 


How useful and how informative is the material that is included 
in free, unstructured communications describing another person? 
Actually, in spite of the vast numbers of recommendations written 
every year, very little of a solid and factual nature is known about 
their adequacy or the effectiveness with which they discharge their 
function. Opinion covers the full gamut from a belief that a free 
and unconstrained letter about an applicant is the best possible way 
to get an evaluation of him to the conviction that letters of recom- 
mendation are completely worthess, from a conviction that the 
letter of recommendation is the core of any selection program to à 
feeling that the best thing to do with recommendations is to burn 
them. But factual studies of the reliability and validity of the in- 
formation that is gotten from a letter of recommendation or of the 
extent to which recommendations influence the action taken with re- 
spect to an applicant are fragmentary in the extreme. 

The letter of recommendation is such an unstructured document 
that it is very hard to study by sound re 
ever, several investigators have attempted to make analyses of the 
content of the letters and to scale them with respect to the enthu- 
siasm of the endorsement they provided, A moderate degree of 
agreement has been found * between different letters written about 
the same person. Within a group of applicants for jobs in secondary- 
school teaching from one teacher-tr, 


aining institution the between- 
letters reliability would be represented by a correlation of about 0.40. 
There was some evide 


nce in this same study that the letters of those 
who got the jobs were a little higher оп the enthusiasm scale than 
letters of applicants who were not employed. However, another 
study failed to find any difference betwe 
scribe job getters and other applicants. 
The extent to which a letter of recommend 
appraisal of an individual and the extent to which it is accurately 
diagnostic of outstanding points, Strengths or weaknesses, is almost 
completely unknown. However, we cannot be very sanguine. Most 
of the limitations that we shall presently à 1 
more structured rating 
controlled letters, 


search techniques. How- 


en the terms used to de- 


ation provides a valid 


discuss in connection with 
scales apply with at least equal force to un- 
In addition, each respondent is free to go off in 
whatever direction his fancy dictates, so that there is no core of 
content common to the different letters about a single person or to 
the letters dealing with different Persons. One letter may deal with 
A's social charm; a second, with B's integrity: and a third, with C's 
originalit On what common base $ 
Add to this the facts that 


92 7 
are we to compare the three: 
(1) the applicant usually is more or less 


RATING SCALES 337 


free to select the persons who will write about him and may be ex- 
pected to pick those who will support him and that (2) recommenders 
differ profoundly in their propensity for using superlatives, and the 
prospect is not a very гозу опе. 

Further research studies of the validity of free descriptions of one 
person by his fellows are urgently needed. In the meantime, recom- 
mendations will continue to be written—and perhaps to be used. 
We must turn our attention to more structured evaluation pro- 


cedures, 


RATING SCALES 


Undoubtedly it was in part the extreme subjectivity of the un- 
structured statement, the lack of a common core of content or stand- 
ard of reference from person to person, and the extraordinary diffi- 
culty of quantifying the materials that gave impetus to the develop- 
ment of rating scales. Rating procedures attempt to overcome just 
these deficiencies. They attempt to get appraisals on a common set 
of attributes for all raters and ratees and to have these expressed on 
à common quantitative scale. 

We all have had experience with ratings, either in making them 
or in having them made about us or, more probably, in both capaci- 

proportion of school report 


tes, Rating scales appear in a large 
Thus, we often find 


cards, more clearly in the non-academic part. 
à section phrased somewhat аз follows: 


Ist 2nd 3rd Hh 
Period Period Period Period 
Effort = = ы 
Conduct — nS 
Citizenship ==. - — 
Cooperation а = = — 
Adjustment 2 =. — а 


Н = superior S = satisfactory U = unsatisfactory 
| industrial firms send rating forms 
by job applicants, asking for evalua- 
"originality," "enthusiasm," 


Many civil service agencies anc 
Out to persons listed as references 
tions of the individual's "initiative, t а 
Or "ability to get along with people." These вате companies or 
agencies often require supervisors to give merit ratings of Her em- 
ployees, rating them as E very good, Р good, 

"satisfactory" or “unsatisfactory” on a variety of traits or in over- 
all usefulness. Colleges. medical schools, fellowship programs, and 


still other agencies call for ratings as a part of their selection pro- 


“superior, excellent, 


338 THE INDIVIDUAL AS OTHERS SEE HIM 


cedure. Beyond these practical operating uses, ratings have been 
involved in numbers of research projects. All in all, vast numbers 
of ratings are called for and given, often reluctantly, in our country 
week by week and month by month. Rating other people is a large- 
scale operation. 

The most common pattern of rating procedure presents the rater 
with a set of trait names, perhaps somewhat further defined, and a 
range of numbers, adjectives, or descriptions that are to represent 
levels or degrees of possession of the traits. He is called upon to 
rate one or more persons on the trait or traits by assigning him or 
them the number, letter, adjective, or description that is judged to 
fit best. Many variations have been rung in on this basic theme, 
and we shall consider them presently. Right now, however, let us 
consider some of the problems that arise when we try to get a group 
of judges to make these appraisals. 


PROBLEMS IN OBTAINING SOUND RATINGS 


The problems in obtaining valid appraisals of an individual through 
ratings are of two main sorts. There are first the factors that limit 
the rater's willingness to rate honestly and conscientiouslv, in accord- 
ance with the instructions given to him. There are secondly the 
factors that limit his ability to rate consistently and correctly, even 
with the best of intentions. We shall need to consider cach of these 
in turn, 


FACTORS AFFECTING THE RATER'S WILLINGNESS TO RATE CONSCIENTIOUSLY 


It is commonly assumed that cz 


n | ach rater is trying his best to fol- 
ow the instructions that have been given him, and that his failures 
are due entirely to human f. 


i 1 allibility and ineptitude. However, this 
is not necessarily true. There are at least two sets of circumstances 
that may impair the integrity of a set of ratings: (1) The rater may 
be unwilling to take the trouble that is calle 1f 
procedure; and (2) the rater тау 

such an extent that he is unwilling 
him. Each of these merits some el 


d for by the appraisal 
identify with the person rated to 
to make a rating that will hurt 
aboration. 

Unwillingness to Take the Necessary Pains. 
bother. Careful and thoughtful rati; 
In some rating procedures the 


At best, ratings are а 
atings are even more of a bother. 
some ra ‹ attempt is made to get away from 
subjective impressions and superficial reaction by 


i introducing elab- 
orate procedures and precautions into the 


rating enterprise. Thus. 


in one attempt to improve efficiency rating procedures for Air Force 


FACTORS AFFECTING THE RATER'S WILLINGNESS TO RATE 339 


officers," an elaborate form was introduced that was to serve as a 
combined observational record and rating form. Fifty-four specific 
critical behaviors were described relating to officer efficiency. Scales 
were prepared describing degrees of excellence in each type of be- 
havior. The accompanying instructions called upon raters to ob- 
serve their ratees for a period before the official ratings were to be 
given and to tally on the rating form instances that had been ob- 
served of desirable and undesirable acts within each of the behavior 
categories described on the scale. After a year or two of use this 
form was discarded, in part at least because of its complexity and 
because raters were not willing to devote the time and thought that 
would have been required to maintain the preliminary observational 
records on which the ratings were to be based. 

In a lesser degree, one suspects that perfunctoriness in carrying 
out the operation of rating is a factor contributing to lowered effec- 
Particularly if the number of 


liveness in many rating programs. 
pupils or employees to be rated is large, the task of preparing periodic 
ratings can become a decidedly onerous one. Unless raters are really 
"sold" on the importance of the ratings, the judgments are likely to 
be hurried and superficial ones, given more with an eye on finishing 
the task than with a concern for making accurate and analytical 
judgments. 

Identification with the Persons Being Rated. Ratings are often 
called for by some rather remote and impersonal agency. The Civil 
Service Commission, the Military Personnel Division of a remote 
Headquarters, the personnel director of a large company, or the 
central administrative staff of a school system are all pretty far away 
from the first line supervisor, the squadron commander, or the class- 
room teacher. The rater is often closer to the persons being rated, 
the workers in his office, the junior officers in his outfit, the pupils 
s, than to the agency that requires the ratings to be made. 

of supervision or leadership is that the 
the needs and. welfare of his followers or 


in his clas: 
One of the first. principles 


£ood leader looks out for А 
Morale in an organization depends upon the convic- 


subordinates. 
tion that the leader of the organization will take care of the members 


When ratings come along, "taking care of" becomes 


of the group. 
a matter of seeing to it that one’s own men fare as well as—or a 


little better than—those in competing groups. 

All this boils down to the fact that in some situations the rater is 
more interested in providing a "break" for the people whom he is 
rating and in seeing that they get at least as good treatment as 


other groups than he is in providing accurate information for the 


340 THE INDIVIDUAL AS OTHERS SEE HIM 


using agency. This situation is aggravated in many governmental 
and official agencies by a policy of having the ratings public and re- 
quiring that the rater discuss with the person being rated апу un- 
favorable material in the ratings. A further aggravation is produced 
by setting up administrative rulings in which a minimum rating is 
specified as required for promotion or pav increase. No wonder, 
then, that ratings tend to climb or to pile up at a single scale point. 
Thus, in certain governmental agencies during World War II the 
typical rating, accounting for a very large proportion of the ratings 
given, was "excellent." “Very good" became an expression of 
marked dissatisfaction, while a rating of “satisfactor v" was reserved 
for someone you would get rid of at the first opportunity. 

It is important to realize that a rater cannot always be depended 
upon to work wholeheartedly at giving valid ratings for the benefit 
of the using agency, that making ratings is usually a nuisance to 
him, and that he is often more committed to his own subordinates 
than to an outside agency. A rating program must be continuously 
"sold" and policed if it is to remain effective. And there are limits 
to the extent to which even an active campaign can overcome a 
rater's natural inertia and interest in his own little group. 


FACTORS AFFECTING THE RATER'S ABILITY TO RATE ACCURATELY 


Even when a group of raters are presumably well motivated and 
doing their best to provide valid judgments, there are still a number 
of factors that operate to limit the validity of those judgments. 
These center around the ambiguity of the quality to be observed. 
the covertness of the attribute, 1 у 
of a uniform standard of refe 
idiosyncrasies. 

Ambiguity of Meaning of Dimension to Be Rated. Many rating 
forms call for ratings of quite broad and abstract traits. Thus, in 
our illustration on p. 337 we included, among others, "citizenship" 
and "adjustment." These are neither 
general than the attributes included in other rating schedules. But 
what do we mean by "citizenship" in an clementary-school pupil? 
By what actions is "good citizenship" shown? Docs it mean not 
marking up the walls? Or not Spitting on the floor? Or not pulling 
little girls' hair? Or bringing ne 


ack of opportunity to observe, lack 
rence, and specific rater biases and 


more nor less vague and 


wspaper clippings to class? Or join- 
ing the Junior Red Cross? Or staying after school to help the 
teacher clean up the room? What does it mean? Probably no two 
raters would have just exactly the same things in mind when they 
rated a group of pupils on “citizenship.” 


FACTORS AFFECTING THE RATER'S ABILITY TO RATE ACCURATELY — 341 


Or consider "initiative," "personality," "supervisory ability," 
"mental flexibility," "executive influence," or "adaptability." These 
are all examples from rating scales in actual use. Though there is 
certainly a core of uniformity in the meaning that these terms will 
have for different raters, there is with equal certainty a good deal of 
variability in meaning from one rater to another. In proportion as 
a term becomes abstract, its meaning becomes variable from person 
to person, and such qualities as those listed above are conspicuously 
abstract. 

The rating that a given child will receive for "citizenship" will, 
then, depend upon what "citizenship" means to the rater. If it 
means to rater А conforming to school regulations, he will rate cer- 
tain children high. If to rater B it means taking an active role in 
school projects, the high ratings may go to quite different children. 
A first problem in getting consistent ratings is to achieve consistency 
in the meanings of the qualities being rated. 

Covertness of Trait Being Rated. If a trait is to be appraised by 
an outsider, someone other than the person being rated, it must show 

It must be something that has its impact on the 
Such characteristics as appearing at ease at social 
gatherings, having a pleasant speaking voice, and participating ac- 
tively in group projects are characteristics that are essentially social 
in character. They appear in interaction with other persons and are 
They are overt aspects of the person being ap- 
attributes such as "feeling of insecurity,” 
r "loneliness" are inner personal quali- 


on the outside. 
outside world. 


directly observable. 
praised. By contrast, 
“self-sufficiency,” "tension," o 
ties. They are private aspects of personality and can only be crudely 
inferred from what the person does. They are covert aspects of the 
individual. 

An attribute which is largely covert can be judged by the outsider 
only with great difficulty. Little of inner conflict or tension shows 
on the surface, and where it does show it is often їп masquerade. 
Thus, a child's deep insecurity may express itself as aggression 
against other pupils in one child, or as withdrawal into an inner 
The insecurity is not a simple dimension of overt 


world in another. 
at may break out 


behavior. It is an underlying dynamic factor th 
in different persons or even in the same person at 
Onlv a thorough knowledge of the individual, com- 
s ical insight, makes it possible to 
f his underlying covert 


in different ways 
different times. 

bined with a good deal of psycholog 
infer from the overt behavior the nature o 


dynamics. 


342 THE INDIVIDUAL AS OTHERS SEE HIM 


One can see, then, that rating procedures will be relatively un- 
satisfactory for the inner, covert aspects of the individual. Qualities 
that depend upon very thorough understanding of a person plus 
wise inferences from his behavior will be rated with low reliability 
and little validity. Ratings have most chance of being accurate for 
those qualities that show outwardly as a person interacts with other 
people. 

Opportunity to Observe the Person Rated. One factor that must 
always be borne in mind as a consideration limiting the accuracy of 
rating procedures is limited opportunity on the part of the rater to 
observe the person being rated. Thus, the high-school teacher teach- 
ing four or five different class groups of 30 pupils cach and seeing 
many pupils only in a class setting may be called upon to make 
judgments as to the "initiative" or “flexibility” of these pupils. The 
college instructor who has taught a class of 100 pupils will receive 
rating forms from an employment agency or from the college ad- 
ministration asking for similar judgments. The truth of the matter 
is that effective contact with the person tc 
been too limited to provide any adequate basis for the judgment 
that is being requested. True, the ratee has been physically in the 
presence of the rater for a good many hours 
but these have bee 
things than observ 


› be rated has probably 


maybe several hundred, 
n very busy hours, concerned primarily with other 
ing and forming judgments about pupil A. Pupil 
A has had to compete with pupils B, C, D, and on to Z and also 
with the primary concern with teaching rather than judging 


etting much the same thing is true. 
The primary concern is with getting the job done, and although in 
theory the supervisor has had a good deal of time to observe each 
worker, in practice hc has been busy with other things. We may be 
able to "sell" supervisors on the idea of 
energy to observing and evalu 


In a civil service or industrial s 


devoting more of their 
ating the persons working for them, 
but there are very real limits to the amount of effort that can be 
withdrawn from a supervisor's other functions to be applied to this 
one. 


We face not only the issue of general opportunity to observe, but 
also that of specific opportunity to observe 


a particular aspect of 
the individual's personality. 


This is related to the degree of overt- 
ness of the trait, as discussed in the Previous section. But it relates 
also to the circumstances under which the rater has seen the ratec 
functioning. Thus, we might question whether a teacher in a thor- 
oughly conventional classroom has se 


en a child under circumstances 
under which he might be expecte 


‘d to show "initiative" or “origi 


FACTORS AFFECTING THE RATER'S ABILITY TO RATE ACCURATELY 343 


nality.“ The college instructor who has taught largely through lec- 
tures is hardly well situated to rate a student's “presence” or "ability 
to work with individuals." The supervisor of a clerk doing routine 
work is poorly situated to appraise "judgment." Whenever ratings 
are proposed, either for research purposes or as a basis for adminis- 
trative actions, we should ask with respect to each trait being rated: 
Has the rater had a chance to observe these people in enough of the 
sorts of situations in which thev could be expected to show variations 
in this trait so that his ratings can be expected to be meaningful? 
If the answer is "No," we would be well advised to abandon the 
ratings. 

Uniform Standard of Reference. A great many rating schedules 
call for judgments of the persons being rated in some set of cate- 


gories such as 


Outstanding, above average, average, below average, unsatis- 
factory. 

Superior, good, fair, poor. 

Best, good, average, fair, poor. 

Outstanding, superior, better than satisfactory, satisfactory, un- 
satisfactory. 

Superior, excellent, very 


good, good, satisfactory, unsatisfactory. 


But how good is "good"? Is a person who is "good" in judg- 
group with whom he is being compared? 
The top quarter? The top half? Or is he just rot one of the bottom 
tenth? And what is the group with whom he is supposed to be com- 
pared? Is it all men of his age? All employees of the company? All 
men in his particular job? All men in his job with his length of ex- 
perience? If the last, how is the rater supposed to know the level of 
judgment that is typical for men in a particular job with a particu- 


ment” in the top tenth of the 


lar level of experience? 
The problem that all th 
forming a standard against W 
rms and | 
ariations in experience with the mem- 
all contribute to variability from 


ese questions are pointing up is that of 
hich to appraise a given ratee. Varia- 


tions in interpretation of te abels, variations in definition of 


the reference population, and v 


bers of that background population 
rater to rater in their standards of rating. The phenomenon is а 


familiar one in academic grading practices. Practically every school 
that has studied the problem has found enormous variations among 
facultv members in the per cent of A's, B's, and C's that they give. 
The same situation holds for any set of categories, numbers, letters, 


344 THE INDIVIDUAL AS OTHERS SEE HIM 


or adjectives, that may be used. Standards of interpretation are 
highly subjective and vary widely from one rater to another. One 
man's "outstanding" is another man's "satisfactory." 

Specific Rater Idiosyncrasies. Not only do raters differ in general 
"toughness" or “softness.” They also differ in a host of specific 
idiosyncrasies. The experiences of life have built up in each of us 
an assortment of likes and dislikes and an assortment of individual- 
ized interpretations of the characteristics of people. You may dis- 
trust anyone who does not look at you while he is talking to you. 
Your neighbor may consider any man a sissy who has a voice pitched 
higher than usual. Your boss may consider that a firm handshake 
is the guarantee of a strong character. Your golf partner may be 
convinced that blonds are flighty. These are rather definite reac- 
tions that may be explicit and clearly verbalized by the person in 
question. There are myriad other more vague and less tangible 
biases that we carry with us and that influence our ratings. These 
biases help to form our impression of a person and color all aspects 
of our reaction to him. They enter into our ratings too. In some 
cases, our rating of one or two traits may be affected. But often 
the bias is one of general liking for or ave 
this generalized reaction colors all our 
ratings reflect not only the g 
rater, but also his spe 
rated. 


sion to the person, and 
specific ratings. Thus, the 
eneral subjective rating standard of the 
cific biases with respect to the person being 


THE OUTCOME OF FACTORS LIMITING RATING EFFECTIVENESS 

What is the net result of these 
ness to rate conscientiously and 
effects show up in certain pe 
relatively low reliabilities 
rating procedures. 

The Generosity Error. 
much committed to the 


factors affecting the raters’ willing- 
ability to rate accurately? The 
rvasive distortions of the ratings, in 
‚ and in doubt as to the basic validity of 


We have seen that the rater is often as 
people he is rating as he is to the agency 
for which ratings are being Prepared. Over and above this, there 
seems to be a widespread unwillingness on the part of raters to damn 
a fellow man with a low rating. The net result is that ratings tend 
almost universally to pile up at the high end of any scale. The un- 
spoken philosophy of the rater scems to be “one man is as good as 
another, if not a little better," so that "average" becomes not thc 


mid-point of a set of ratings but near the lower end of the group: 


One finds quite generally the paradox of a great majority of the 
group being rated above average. 


THE OUTCOME OF FACTORS LIMITING RATING EFFECTIVENESS 345 


The generosity error, if it operated uniformly for all raters, would 
not be particularly disturbing. We would merely have to remember 
that ratings cannot be interpreted in terms of their verbal labels 
and that “average” means "low" and "very good" means "average." 
Makers of rating scales have countered this humane tendency by 
having several steps on their scale on the plus side of average, so 
that there is room for differentiation without having to get disagrec- 
able and call a person "average." 

It is differences between raters in the degree of their "generosity 
error" that are more troublesome. To correct for such differences 
is a good deal more of a problem. We shall consider presently some 
special techniques that have been developed for that purposc. 

The Ilalo Error. Limitations in our experience with the person 
being rated, lack of opportunity to observe the specific qualities 
that are called for in the rating instrument, and the influence of 
personal biases that affect our general liking for the person all con- 
spire to produce another type of error in our ratings. This is a ten- 
dency to rate in terms of over-all general impression without differ- 
entiating specific aspects, of allowing our total reaction to the person 
to color our judgment of each specific trait. This is called "halo." 

We can illustrate halo by a set of data on embryo airplane com- 
Students were rated by their instructors 
“foresight,” "leadership," instru— 
"lead crew potentiality," and 


manders in World War II. 
for such qualities as “eagerness, 
ment flying," "formation flying," 
The correlation between two raters for the same 
This serves as a measure 


"over-all value." 
attribute was, on the average, about 0.60. 
of the reliability of the ratings. We may speak of it as the between- 
raters reliability. The average correlation between different attri- 
butes for the same rater was about 0.75. That is, the correlation 
between ratings of different qualities was higher than the reliability 
of the separate ratings. This consistency can only be accounted for 
by a general halo that made instructor A's appraisal of student B 


much the same no matter what attribute was being rated. 

Of course, some relationship among desirable traits is to be ex- 
pected. We find correlation among different abilities when these are 
tested by objective tests and do not speak of the halo effect that 
produces a correlation between verbal and mechanical ability. Just 
how much of the relationship between the different qualities on which 
we get ratings is genuine and how much of it is spurious halo is very 
hard to determine. That some of the relationship is due to inability 
to free oneself from general biases seems clear, however, from exam- 


ples such as the one we have just given. 


346 THE INDIVIDUAL AS OTHERS SEE HIM 


Reliability of Ratings. Studies have shown repeatedly that the 
between-raters reliability of the conventional rating procedure is low. 
Symonds," writing in 1931, summarized a number of studies and 
concluded that the correlation between the ratings given by two 
independent raters for the conventional type of rating scale is about 
0.55. There seems to be no good reason to change this conclusion 
after the lapse of years. When the two ratings are uncontaminated; 
i.e., the raters have not talked over the persons to be rated, and 
where the usual type of numerical or graphic rating is used, the re- 
sulting appraisal shows only this very limited consistency from rater 
to rater. 

If it is possible to pool the ratings of a number of independent 
raters who know the persons being rated equally well, reliability of 
the appraisal can be substantially increased. Studies have shown 18 
that pooling ratings functions in the same way as lengthening a test, 
and that the Spearman-Brown formula (p. 137) can legitimately be 
applied in estimating the reliability of pooled independent ratings. 
Thus, if the reliability of one г; 


ater is represented by a correlation 
of 0.55, we have the following 


estimates for the reliability of pooled 


ratings: 
2 raters 0.71 
3 raters 0.79 
5 raters 0.86 
10 raters 0.92 


Unfortunately, in many import 
sible to get additional equally qu 
pupil has only one regular cl 
immediate supervisor. 


ant practical situations it is impos- 
alified raters. An elementary-school 
assroom teacher; a worker has only one 
Adding on other raters who have limited ac- 


quaintance with the ratee may weaken rather than strengthen the 
ratings. 


Reliability data on some 


of the newer types of rating devices to 
be discussed presently 


appear somewhat more promising. These 
data will be presented as the methods are discussed. One of the 
gains from basing ratings on specific tangible behaviors will be, it is 
hoped, that the objectivity, and hence the reliability, of the judg- 
ments will be increased. A 

Validity of Ratings. All the limiting 


and distorting factors that w€ 
have been considering make 


us doubtful about the validity of ratings- 
Rater biases and rater unreliability operate to lower validity. How- 
ever, it is usually very difficult to make any statistical test of the 
validity of ratings. The very fact that we have fallen back on 


REFINEMENTS IN THE RATING INSTRUMENT 347 


ratings usually means that no better measure of the quality in ques- 
tion is available to us. There is usually nothing else against which 
we can test the ratings. 

In one context, the validity of ratings is axiomatic. If we are 
interested in appraising how people react to another person, i.e., 
whether a child is well liked by his classmates, ratings are the reac- 
tions of these other persons and are directly relevant to the point at 
issue. 

When ratings are being studied as predictors, statistical data can 
be obtained as to the accuracy with which they do in fact predict. 
This is something that must be determined in each setting and for 
each type of criterion that is being predicted. That ratings are in 
some cases the most valid available predictors is shown in recent 
studies of the ratings of aptitude for military service that are given 
at the U. S. Military Academy.” These ratings by tactical officers 
and by fellow cadets correlated more highly with later ratings of 
performance as an officer than did any other aspect of the man's 
record at West Point. Correlations with ratings of effectiveness in 
combat in Korea were about 0.50. This criterion is again a rating, 
but it is probably as close to the real “pay off" as we are likely to get. 
In other situations, of course, ratings may turn out to have no valid- 
ity at all. Each type of situation must be studied for its own sake. 


IMPROVING THE EFFECTIVENESS OF RATINGS 


So far we have painted a rather gloomy picture of rating tech- 
niques as devices for appraising personality. The hazards and pit- 
But for all their limitations, 


falls in rating procedures are many. 
host of situations in which we 


there are and will continue to be a 
will have to depend upon the judgments of other people as a pro- 
cedure for appraising our fellow men. The sincerity and integrity of 
à potential medical student, the social acceptability of a would-be 
ntiousness of a private secretary can probably 
only be evaluated through the judgment that someone makes of 
these qualities in. the individuals in question. What can be done, 
then, to mitigate the defects of rating procedures? We shall consider 
| rating instrument and then the planning and 


salesman, the conscie 


first the design of the 
conduct of the ratings. 


REFINEMENTS IN THE RATING INSTRUMENT 
The usual rating instrument has two main components: (1) a set 
of stimulus variables (the qualities to be rated) and (2) a pattern of 


348 THE INDIVIDUAL AS OTHERS SEE HIM 


response options (the ratings that can be given). In the simplest 
and most conventional rating forms, the stimulus variables consist 
of trait names and the response options consist of numerical or ad- 
jectival categories. Such a form was illustrated on p. 337. This 
type of format appears to encourage most of the shortcomings that 
we have been discussing in the preceding section. Consequently, 
many variations and refinements of format have been tried out in an 
attempt to overcome or at least minimize the shortcomings. The 
variations have manipulated the stimulus variables, the respons: 
options, or both. Some of the main variations are described below. 


REFINEMENTS IN PRESENTING THE STIMULUS VARIABLES 


Bare trait names represent unsatisfactory stimuli for a rater for 
two reasons. In the first place, as we pointed out on p. 340, the 
words mean different things to different people. The child who 
shows "initiative" to teacher A may show “insubordination” to 
teacher B, whereas teacher B's "good citizen" à 
А a "docile conformist." In the second place, the terms are quite 
abstract and far removed in many cases from the realm of observable 
behavior. Consider "adjustment," for example. We do not observe 
a child's adjustment. We observe а host of reactions to situations 
and people. Some of these reactions are perhaps symptomatic of 
poor adjustment. But the judgment about the child's adjustment 
isa good many steps removed from what we 

Workers with ratings have 
meaning in the traits to be rat 
the ratings more close 
have modified the 


may seem to teacher 


have a chance to observe. 
striven to get greater uniformity of 
ed, and they have attempted to base 
ly upon observable behavior. These attempts 
stimulus aspect of rating instruments in three ways- 

1. Trait Names Have Been Defined. A phrase, sentence, or several 
sentences have been appended to cach trait name to give it greater 
uniformity of meaning. Thus, we might have: 


Citizenship. Participation in schoo! 


rart l projects. Willingness to do his 
share. Responsibility for work 


and property, 


This represents a somewhat more objective and behavioral state- 


ment and should produce at least Some more uniformity in meaning 
among a group of raters, However, we may doubt that a brief 
verbal definition will completely overcome the individual differences 
in meaning that different raters bring to the task, 


2. Trait Names Have Been Replaced by Several More Concrete and 
Limited Descriptive Phrases. Thus, the abstract and blanket ter™ 


REFINEMENTS IN PRESENTING THE STIMULUS VARIABLES 349 


"citizenship" might be broken down into the several components 
suggested above, i.e.: 

Participation in school projects. 

Willingness to do his share. 

Responsibility for completing work. 

Carefulness with school property. 
A judgment would now be called for with respect to each of the 
more limited and more concrete aspects of pupil behavior. 

3. Each Trait Name Паз Been Replaced by a Substantial Number 
of Descriptions of Specific Behaviors. This carries the move toward 
concreteness and specificity one step farther. Following our analysis 
of “citizenship,” we might replace it with a set of behaviors some- 


what as follows: 


a. Works well with other children in groups and committees. 
b. Brings materials to school. 

c. Does his work without complaining. 

d. Gets assigned work in on time. 

e. Keeps desk and work area neat. 

f. Uses materials without wasting. 

g. Works steadily, even when not watched. 

h. When one task is done, finds other work to do. 

Takes care of school property. 


This list is still more tangible and specific. There should be rela- 
tively little opportunity, in each case, for ambiguity as to what it is 
that is being observed and reported on. 

The replacement of one general term with many specific behaviors 
gives promise of achieving more uniformity of meaning from one 
rater to another. It may also bring the ratings in closer touch with 
actual observations that have been made of the behavior of the in- 
dividual who is being appraised. Where the trait to be rated is one 
that the rater has really had no opportunity to observe, the attempt 
to replace the trait name with specific observable behaviors will often 
make this fact painfully apparent and will force the designer of the 
instrument to rethink the problem of relating his instrument to the 
observations that the rater has really had an opportunity to make. 

The gains that a specific list of behaviors achieves in uniformity 
of meaning and concreteness of behavior judged aie not without 
cost. The cost lies in the greatly increased length and complexity 
of the rating instrument. There are limits to the number of different 
judgments that can be asked of a rater. Furthermore, the lengthy, 
analytical report of behavior may be confusing to the person who 
tries to use and interpret it. The lengthy list of specific behaviors 


350 THE INDIVIDUAL AS OTHERS SEE HIM 


will probably prove most effective when (1) judgments are in very 
simple terms, such as simple present-absent and (2) there are provi- 
sions for organizing and summarizing the specific judgments into one 
or more scores for broad areas. 


VARIATIONS IN FORM OF RESPONSE CATEGORIES 


Expressing judgments about a ratee by selecting some one of a set 
of numbers, letters, or adjectives is still common on school report 
cards or in civil service and industrial merit rating systems. How- 
ever, these procedures have little other than simplicity to commend 
them. As we saw on p. 343, the categories are arbitrary and unde- 
fined. No two raters interpret them in exactly the same way. A 
rating of "superior" mav be given to 5 per cent of employees by one 
Supervisor and to 25 per cent by another. One man's A is another 
man's B. Subjective standards reign supreme. 

Various attempts have been made to manipulate the response op- 


tions to try to achieve a more meaningful scale or greater uniformity 
from rater to rater. 


1. Percentage of Group. To try 
from rater to rater and to produce 
ratings given by a particular rate 
for in terms of percentage 


to produce greater uniformity 
greater discrimination among the 
r, judgments are sometimes called 
of a particular defined group. Thus, the 
professor rating an applicant for a fellowship is instructed to rate 
each candidate according to the following scale: 


Falls in the top 2 per cent of students 
In top 10 per cent, but not in top 2 pei 
In top 25 per ce 


at his level of training. 
r cent. 

nt, but not in top 10 per cent. 

In top half, but not in top 25 per cent. 


In lower half of students at his level of training. 
Presumably, the specified percentages of a defined group provide а 
uniform standard of quality for different raters. However, the strata- 
gem is usually only partially successful. Individual differences in 
generosity are not that easily suppressed. 

2. Graphic Scale. A Second variation is more a matter of form 
than clarity of definition, Rating scales are often prepared so that 
judgments may be recorded аз a check at some appropriate point 


on à line, instead of by choosing a number, letter, or adjective. For 
example: 


ERI Sp pe ct 
Responsibility for Very Average Very 
Completing Work high 


VARIATIONS IN FORM OF RESPONSE CATEGORIES 351 


The pattern often makes a fairly attractive page layout, is compact 
and economical of space, and seems somewhat less forbidding than a 
form which is all print. However, this particular variation does not 
seem to have much advantage other than attractiveness and con- 
venience. 

3. Behavioral Statement. We have seen that the stimuli may be 
in the form of relatively precise behavioral statements. Statements 
of this sort may also be used to present the choice alternatives. 


Thus, we may have an item of this type: 


Participation in School Projects 


| | | | | 


Volunteers to bring Works or brings materials Does as little as possi- 
in materials. Sug- asrequested. Participates, ble. Resists attempts 
gests ideas. Often but takes no initiative. to get him to help. 


works overtime. 

statements describing behavior are combined with 
used to define three points on the scale. 
expected to lend more concreteness and 
However, these editorial 
idiosyncrasies, which 


In this case, three 
a graphic scale, and are 
The descriptions may be 
uniformity of meaning to the scale steps. 
provisions do not completely overcome rater 
continue to plague us. 

4. Man-to-Man Scales. Ап early attempt to get more uniformity 
of meaning into the response scale, developed in World War I, used 
men instead of numbers, adjectives, or descriptions to represent the 
scale points. The rater is asked to think of someone he has known 
ry high on the quality being rated. That person's 
name is then entered on the rating form to define the "very high" 
point on the scale. In the same way, the names of other persons 
known well by the rater are entered in spaces to define "high," 
"average," "low," and "very low." The five names then define 
levels for the trait. When a person is to be rated, the rater is in- 
Structed to compare him with the five persons defining the levels on 
the trait. The rater is to judge which man he most closely resembles 
on the trait in question. He is assigned the value corresponding to 
the step on the scale which that man occupies. 

It was thought that the man-to-man feature would lend concrete- 
ness to the comparisons and overcome the tendency of some raters 
to be consistently generous. In cases in which all raters have a 
wide range of acquaintance, so that their scale persons may be ex- 
pected to be fairly comparable, the procedure may make for more 
uniformity from rater to rater. But such scope of acquaintance and 


well who was ve 


352 THE INDIVIDUAL AS OTHERS SEE HIM 


thoroughness of familiarity with suitable scale persons is likely to 
be somewhat unusual in the practical situations in which ratings 
must be made. Implicit comparison with other persons is involved 
in апу rating enterprise, but explicit use of particular persons to 
define the steps on a rating scale has not been widely adopted. 

5. Present—Absent. When a large number of specific behavioral 
Statements are used as the stimuli, the response that is called for is 
often a mere checking of those that apply to the individual in ques- 
Поп. The person is then characterized by the statements that are 
checked as representing him. The rating scale becomes а behavior 
check list. The set of items on p. 349 might constitute part of such 
a check list. 

If this type of appraisal procedure is to yield a score, the state- 
ments must be scaled or assigned score values in some wa The 
simplest way is merely to score them +1, —1, or 0, depending upon 
whether they are favorable 


‚ unfavorable, or neutral with respect to 
a particular attribute (i. e., perseverance, integrity, reliability, etc.) 
or a particular criterion (i.c., success in academic work, success on à 
job, responsiveness to therapy, etc.). An individual's score can then 
be the sum of the scores for the items he checks. 

If the additional elegance seems justified, more refined scaling pro- 
cedures can be applied to the statements. Scale values can be based 
on their judged significance or the degree to which they had actually 
discriminated between successful and unsuccessful individuals. The 
score an individual receives is then based on an averaging of the 
scale values of the items that were che 
reliability of such a check list of sc 
quite satisfactory in some instanc 
porting a correlation of 0.83 be 
groups of salesmen, 

Only limited use has been made of che 
scores on each individual, but they see 
tern, One well-known instrument th 


cked as describing him. The 
aled items has been found to be 
es, Richardson and Kuder te re- 
tween two independent raters of 


ck lists as devices to yield 
m to present a promising pat- 
at follows essentially this pat- 
tern is the Vineland Social Maturity Scale. This check list is made 
up of items relating to self-help, self-direction, communication, 


socialization, and the like. Selected items from different levels of 
the scale are shown in Table 13:1; 


Norms for the scale 
the age at which the 
list is filled out by a 


were established for e 
behavior appears on th 
rater who knows the 
5 or can do are checked, 
lished for which all items are positive, 


ach item, representing 
е average. The check 

child being appraised. 
Items the person doe A basal age is estab- 
and the person being rated is 


VARIATIONS IN FORM OF RESPONSE CATEGORIES 353 


automatically given credit for all earlier items. Points are given for 
additional items passed. The table of norms gives developmental 
age equivalents for the point scores, and a developmental quotient 
тау be computed that indicates the individual's rate of progress 
toward self-sufficiency and independence. 


Table 13.1. Items Selected from the Vineland Social Maturity Scale 


Item No. Age Level Item 
(in years) 

1 0-1 “Crows,” laughs 

6 2-1 Reaches for nearby objects 

11 0-1 Drinks from cup assisted 

15 0-1 Stands alone 

19 1-2 Marks with pencil or crayon 
28 1-2 Eats with spoon 
3 1-2 Talks in short sentences 
37 2-3 Removes coat or dress 
40 2-3 Dries own hands 
44 2-3 Relates experiences 
5 +5 Cares for self at toilet 
53 4-5 Goes about neighborhood unattended 
68 7-8 Disavows literal Santa Claus 
70 7-8 Combs or brushes hair 
78 10-11 Writes occasional short letters 
80 10-11 Does small remunerative work 


The check-list pattern has been used as a simple descriptive in- 
strument, as in school reports to the home. The procedure is attrac- 
tive in this setting because it can give information on specific aspects 
of pupil development. However, forms tend to become complicated 
and to confuse many parents, so this type of reporting has not been 
widely adopted. 

6. Frequency of Occurrence, or Typicality. Instead of reacting in 


an all-or-none fashion to an item, as in the check list, response can 


be qualified as being "always," "usually," "sometimes, seldom, 


Or "never" characteristic of the ratee. Or the ratee may be charac- 


terized as "verv much like," “а good deal like," "somewhat like," 
"slightly like," or “not at all like" the behavior described in the 
Statement. (The terms of frequency or resemblance may vary; the 


Ones given are only suggestive.) Ап individual's score would now 
take account both of the significance of the statement and the point 


354 THE INDIVIDUAL AS OTHERS SEE HIM 


on the scale that was checked. That is, an important attribute 
would receive heavier credit than a minor one, and a check at the 
"always" step more credit than a check at "usually." 

Indefinite designations of frequency or degree of the sort that аге 
being discussed here will be differently interpreted by different 
raters, so the old problem of differences in rater standards is still 
with us. However, there is one technique that should largely elim- 
inate such differences in rater standards. One could include in the 
list of behaviors to be checked a varied assortment of statements 
that sound desirable but are unrelated to the trait that one wishes 
to evaluate or the criterion one wants to predict. For example, on 
a fellowship application blank one might include, along with “shows 
originality in attacking problems” and “has good command of his 
major field,” items such as “work is always neat and orderly" and 
"is well-liked by fellow students." One could use the average rating 
on these non-significant items as a base-line for appraising the rating 
on the significant Statements. This base-line might be expected to 
reflect both the rater's interpretation of the scale categories and his 
general bias with respect to the person being rated. By using the 
non-significant statements as a base-line, it should be possible to 
wash out both differences in rater standards and bias with respect 
to the ratee. The preparation of a rating instrument that will ac- 
complish this represents a rather complex and exacting psychometric 
undertaking. 

7. Ranking. In those cases in which each rater knows a substan- 
tial number of ratees, he may be asked to place them in rank order 
with respect to each attribute being studied. Thus, a teacher may 
be asked to indicate the child who is most outstanding for contri- 
buting to the class Projects and activities “over and beyond the call 
of duty,” the one who is second, and so on. Usually, the ranker will 


be instructed to start at both ends and work in toward the middle, 
since the extreme cases are usu 


H H H H " 
ally easier to discriminate than th 
large group of average ones in the middle. 


of the ranker, tie ranks may be permitted. If no tie ranks are per- 
mitted, the ranker may feel that the task is an unreasonable опе. 
especially in a group of some size, 

Ranking is an arduous task for the ranker, but it docs achieve 
two important objectives. [t forces the person doing the evaluation 
to make discriminations among those being evaluated. The ranker 
cannot place all or most of the Persons being judged in a single cate- 
gory, as may happen with other reporting systems. Secondly, it 
washes out individual differences among raters in generosity OF 


In order to case the task 


NOMINATING TECHNIQUES 355 


leniency. No matter how kindly the ranker may feel, he must put 
somebody last, and no matter how hardboiled he is, someone must 
come first. Individual differences in standards of judgment are 
eliminated from the final score. 

If scores based on rankings by different judges are to be com- 
bined, there is one assumption that is introduced in rankings that 
may be about as troublesome as the individual differences in judging 
standards that have been eliminated. If we are to treat rankings by 
different judges as comparable scores, we must assume that the 
quality of the group ranked by each was the same. That is, we 
assume that being second in a group of 20 represents the same level 
on the trait being appraised, whichever group of 20 it happened to 
be. Usually we do not have any direct way of comparing the dif- 
ferent subgroups, so about all we can do is assume that they are 
comparable. If the groups are fairly sizable and chosen more or less 
at random from the same sort of population, this may be a reason- 
able assumption. But with small groups or groups selected in dif- 
ferent ways, the assumption of comparability may introduce sub- 
stantial amounts of error into our scores based on ranks. 

Ranks as such do not represent a very useful score scale. The 
meaning depends upon the size of the group: being third in a group 
of three is very different from being third in a group of 30. Further- 
more, steps of rank do not represent equal units of a trait. As we 
Saw in our discussion of percentile norms (Chapter 7), in the usual 
bell-shaped distribution, one or two ranks at the extremes of a group 
represent much more of a difference than the same number of ranks 
For that reason, it is common practice 


near the middle of the group. 
to convert ranks into normalized standard scores in order to get a 
type of score that has uniform meaning without regard to the size 
of the group and uniform units throughout the score range. Special 
tables have been prepared to facilitate this conversion, and tables 
for groups of all sizes up to 25 may be found in Symonds (ref. 17, 


рр. 90-92), 


NOMINATING TECHNIQUES 
If a teacher is to understand pupils, he must have some aware- 
ness of the values and standards that the group sets for its members 
and of the role that each child plays in the 


—the peer culture 
group of his contemporaries— the peer group. The standards and 
values of his peers provide the sanctions and the rewards that are 
very influential in determining how a person will act and how con- 
tent he will be in the group setting. The peer group can be quite à 


356 THE INDIVIDUAL AS OTHERS SEE HIM 


cohesive unit. In such a group any action һу a teacher with respect 
to an individual child is often viewed not only as ап action wees 
against him but also as an action for or against the group to лан 
һе belongs and which identifies with him. Thus, in order both to 
understand the individual and to understand how acts with respect 
to individuals affect the group climate, it is important to appraise 
the role of the individual in the group. 

It is far from easy for the teacher or other outsider to get an ac- 
curate appraisal of eroup structure and of the place of the individual 
in it. The child's role is likely to be scen only from an adult point 
of view and that adult viewpoint to be projected upon the group of 
his contemporaries. "Thus, when a child is helpful, friendly, and 
generally acceptable to the teacher, the teacher is likely to attribute 
to that child a level of influence with other children that he does not 
have. It is often difficult for the teacher to attribute to an active 
and troublesome child his true level of influence. with his peers. 
Teachers are often only dimly aware of the pattern of social inter- 
play in their classroom, the reputation of each pupil among his peers: 
the factors determining prestige in the peer group, the patterns of 
attraction and repulsion, or the individual social aspirations. 

In the understanding of these relationships, peer ratings are often 
helpful. A rating procedure that is very simple and quite effective 
for obtaining appraisals by peers is the 


] . . " le 
nominating technique. W 
will consider this technique first as 


applied to social choices and re- 
jections and then as applied more gener 

To improve their underst 
room, the patterns of friendship and leadership, teachers may use 
the simple expedient of asking pupils to name their choices of best 
friends or of work partners. Fore 
class: "For our unit on Mexico, we 
tees of children who will work toget 
I would like to know which chi 
committee with you. 


ally to trait ratings. 
5 к — РРР 
anding of the social structure in a cla 


ample, a teacher might say to à 
are going to need some commit- 
her on some part of the project: 
геп you would like to have on à 
Put your name on the top of the piece of 
paper I gave you. Then under it put the names of the children you 
would especially like to have on your committee," 

We now have a series of nominations or 
It is possible to show these 
as that shown in Fig. 13.1, 


choices for work partners. 
choices pictorially by a diagram such 
This is called a sociogram and the PFO 
cedure of constructing a sociogram is called sociometr y. 

Procedures to help in the construction of sociograms can be found 


in Moreno ? and in a booklet by the staff of the Horace Mann Lin- 
coln Institute. 


NOMINATING TECHNIQUES 357 


From the sociogram shown in Fig. 13.1, we see that А and B are 
the most sought after members of the group: these are the "stars." 
Pupils J and О did not choose anyone and were not chosen bv any 
other pupils: they are isolates. Pupils Н and I chose each other but 
were not chosen by any other pupils. Except for the mutual friend- 
ship between them, they too are isolates. Pupils P, Q, M, and N 
are fringers: they do not really belong to any of the groups but do 
make choices within the group. 


Choice ——- 
Mutual Choice <> 


Fig. 13.1. Sociogrom of fourth-grade class. 


Figure 13.1 shows the pattern of choices and attractions within 
It would also be possible to have children indicate those 
v would definitely of want in their group. 
nts some slight risks to individual and 
a more complete picture of group 


the group. 
class members whom the 
Calling for rejections prese 
class morale but does permit 


structure. 
The sociogram in Fig. 13.1 indicates that this is not a closely-knit 


group. The rather large number of isolates and fringers and the 
linkages across from one "clique" to the other suggest an unstable 
pattern which is in the process of changing and reforming. Thus, 
the sociogram might represent a class at the beginning of the school 
vear, in which a residue of last year's friendships is mixed with new 
currents and in which pupils from other class groups and other 
schools are not yet integrated into the group. It is in such a setting 


358 THE INDIVIDUAL AS OTHERS SEE HIM 


as this that the teacher can be most effective in bringing isolates 
into the group or promoting new friendships. . 

After the teacher has determined which children are without 
friends or are relatively isolated in the group, he should try to find 
out why this is the case. Sometimes the explanation may be very 
simple. The child may be new to the group and have not yet had 
time to find his place in it. The normal opportunities to get ac- 
quainted, furthered by the teacher's efforts to bring out the new 
child's assets, may be all that is required. The child may be older 
or younger than the rest of the group, having friends in other classes 
or outside of School. The child may not live near any of the other 
children in the class. At other times, the reasons may be more 
subtle, and it may take a good deal of discreet. sleuthing for the 
teacher to find out why Willie or Alice are not chosen by their class- 
mates. 

When the reasons are understood, the teacher can often help х0 
remove them. Sometimes the simple process of coaching the child 
so that he develops competence in athletics may turn the trick. 
The teacher can arrange seats so that a child is placed near one for 
whom he expressed preference. Sometimes helping a child to de- 
velop everyday social graces or to improve his personal appearance 
is all that is needed to make him acceptable. If an isolate or fringer 
has special mechanical or artistic skills, giving him an opportunity 
to use these in class group activities may be effective. . 

In general, the teacher can help a child become integrated. with 
and accepted by his peer group by 
developing friendly. relations, (2) improving social skills, and (3) 
building up a sense of accomplishment or competence. 

Sociometric choices describe the present flow of interaction among 
children rather than indicating апу strong and permanent emo- 
tional structuring. However, the Structuring of a class group affects 
the general emotional climate of the classroom. In a class where 
there are many isolates or children who are 
pletely accepted by a clique, 


(1) providing opportunity for 


"fringers," i.c. not com- 
the morale of the group tends to be 
low and group planning and coordinated group action is made more 
difficult. It is also true that the teacher in dealing with one child 
is quite frequently dealing with the clique to which the child 
belongs. ; 

Sociograms frequently point up mistakes that a teacher makes in 
characterizing a child. Thus, when the teacher h 
and his position in his peer group by 
devices point out these mistakes 


as judged a child 

d AREE 
adult standards, анине 
and give the teacher a framework 


NOMINATING TECHNIQUES 359 


for understanding behavior that taken by itself may seem unex- 
plainable. 

Sociograms have been used in various non-school situations. In 
industry thev have been used to form work groups and have been 
found to stimulate production. They have been used in institutions, 
especially: those for juvenile offenders, to select house groups. 

The sociogram by itself tells the teacher only what children are 
selected or rejected, not the reasons for selection and rejection. It 
is most useful when used in conjunction with good anecdotal records. 
For successful use, especially when rejections are asked for, there 
needs to be a friendly feeling between the teacher and the class. 
Furthermore, the teacher should actually use the nominations as far 
as possible in the way in which he has told the class he would use 
them, 

The teacher should also remember that group structure is not 
Static, especially in younger age groups. One sociogram made at the 
ar will rarely provide an adequate picture of 


beginning of a school yea 
group structure through the year. Furthermore, neither choices nor 
When, as is sometimes 


rejections can be taken entirely at face value. 
the procedure, the number of choices is limited to “three best friends,” 
failure to choose a particular pupil need not mean lack of friendly 
feeling for him. Choices may reflect the prestige of the person 
to be associated with that prestige, rather than 
a link of friendship. The culture pattern in certain age groups 
dictates that rejections follow sex lines. Class and caste distinctions 
also introduce cultural factors influencing choice and rejection. А 
h and tentative picture of the social cur- 


Chosen and a desire 


sociogram is at best a roug 
rents and climate of the group. 
A final word of caution should be sounded about attempting to 


use sociometric data to reconstruct a group or modify a child's role 


init. We have offered some suggestions 
may try to help the relatively. isolated child. However, any such 
manipulations call for а good deal of subtlety. Heavy-handed at- 
tempts by the teacher to manipulate the pupils in the group may 
only ra ite the ills he is trying to cure. 

Nominating techniques have been used for various other purposes 
beside the preparing of sociograms and the studying of social cur- 
rents within the group. They provide a simple procedure for ob- 
laining ratings by a group of peers, and their simplicity makes them 
usable even with elementary school children. Nominations may be 
made with respect to any type of characteristic. For example, each 
сег Candidate School mav be asked to nom- 


as to ways in which a teacher 


member of a unit in O 


360 THE INDIVIDUAL AS OTHERS SEE HIM 


inate the two individuals in his unit who have shown the most evi- 
dence of "leadership" during the training course. He may also be 
asked to nominate two who have shown the least indication of 
leadership. Taking all the nominations for the group as a whole, it 
is possible to arrive at a score for each individual, giving a plus fer 
each favorable nomination and a minus for each unfavorable nomina- 
tion. р 
A variation of the nominating procedure that has been used with 
school children has usually been referred to as the “Guess W ho 
technique or as "Casting Characters.” In this procedure, the chil- 
dren are instructed somewhat as follows: 
Suppose we were going to put on a class play. The characters in the 
play are described below. For each character, you are to put down the 


names of one or more children in the class who would be good for that 
part because he or she is just like that anyway. 


“This person is always 


cheerful and happy—never grouchy or cross. 
“This person is always butting in and telling other people how to do 

things. He cannot mind his own business. ^ 
“This person is very quiet and doesn't get into games or do things with 
other children." 
The number of characters can be extended 


as desired. Each 
“character” is a des 


ription in fairly concrete terms of a quality of 
behavior in which the investigator is 
Opposite ends of a scale can bi 
friendly, dominating у 


interested, Descriptions of 
e included— i.c., friendly versus un- 
ersus submissive, etc.—and can be treated as 
positive and negative nominations on a single s 
ccives a score for each character.“ 
tions he receives, ; 
The attractive feature of the nominating pattern is its simplicity 
which makes it rather painless to administer and usable with young 
groups or groups with little sophistication or experience in rating. 
It is feasible because the large number of rate 
use a simple count of nominations instead of 


"ale. Each child re- 
3 vd 
based on the number of nominz 


ч A 0 
make it possible t 


a rating of the usual type- 
THE "FORCED-CHOICE" PATTERN 


All the variations that we have considered so far operated on the 
same basic pattern, The rater considered one attribute at a tim 
and assigned the ratee to one of 


3 - im 
a set of categories or placed hit 
relative to others on that particul 


ar attribute. We shall now con- 
sider a major departure from that pattern. The essence of the pro- 
cedure we consider now is that the rater considers a set of attributes 
at one time and decides which опе (or ones) most accurately repre- 


THE "FORCED-CHOICE" PATTERN 361 


sents the person being rated. Thus, an instrument developed for 
evaluating Air Force technical-school instructors? included sets of 
items such as the following: 


a. Patient with slow learners. 

b. Lectures with confidence. 

c. Keeps interest and attention of class. 

d. Acquaints classes with objective for each lesson. 


The rater's assignment was to pick out the two items from the set 
that were most descriptive of the person being rated. 

Note that all the statements in the above set are nice things to 
say about an instructor. As a matter of fact, they were carefully 
matched, on the basis of information from preliminary investiga- 
tion, to be just equally nice to say about an instructor. But they 
differ a good deal, again based on preliminary investigations, in the 
extent to which they actually distinguish between persons who have 
been identified on other evidence as being good and poor instructors. 
The most discriminating statement is (a) and the least discriminat- 
ing is (b). Thus, we could assign a score value of 2 to statement 
(a), 1 to (c) and (d), and 0 to (b). A person's score for the set would 
be the sum of the credits for the two items marked as most descrip- 
tive of him. His score for the whole instrument is the sum of his 
scores for 25 or 30 such blocks of four statements. Such a score was 
split-half reliability (0.85 to 0.90), so that the 


found to have good 
for the individual's desirability 


instrument provides a reliable score 
in the eyes of a single rater. This does not, of 


as an instructor 
about the agreement that would be found be- 


course, tell anythin 

tween different raters. 
By casting the evaluation instrument into forced choice sets, the 

makers hope to accomplish three things: 

variation in rater standards of gener- 

are all equally favor- 


1. They hope to eliminate 

Since the items in a set 
about a person, the kindly soul should have no 
yse one rather than another, and the true 


osity or kindliness. 
able things to say 
particular tendency to choc 

nature of the ratee should be the controlling factor. 
2. They hope to minimize the possibility of a rater intentionally 
1 In the ordinary rating scale, the rater is in pretty 


biasing the score. 
complete control of the situation. He can rate a man up or down as 
he pleases. In the forced-choice type of instrument, it is hoped that 
unable to identify which are the significant choices 
will be unable to throw the score one Way Or 


e indications that a forced-choice 


the rater will be 
and that therefore he 
the other at will. There are som 


362 THE INDIVIDUAL AS OTHERS SEE HIM 


instrument is less fakeable than an ordinary rating scale, but it is 
still far from tamper-proof in the hands of a determined rater. 

3. They hope to produce a better spread of scores and a more 
nearly normal distribution of ratings. By making all options equally 
attractive, one minimizes the effect of the generosity error, it is hoped, 
and gets a more symmetrical spread of scores. Again, there is indi- 
cation that this result is achieved at least in part. 


Forced-choice rating instruments are a relatively new develop- 
ment, dating from World War II, though the forced selection of one 
of a set of alternates had been used before that time in self-report 
inventories. Because of the relative novelty of the forced-choice 
pattern, evaluation of its usefulness in merit rating procedures and 
in personality appraisal is still incomplete. This format does appear 
to get away from some of the most troublesome limitations of con- 
ventional rating procedures. However, it has some limitations of its 
own. It has a tendency to create rater resistance, because of the 
difficulty of the judgments that the rater is called upon to make. 
Where the options are negative, i.e., “Is this worker more stupid or 
more lazy?,” the instrument has a good deal of the “Have you 
stopped beating your wife yet?” flavor. And even the judgment as 
to whether employee A is more intelligent or more industrious is not 
easy to make. There often seems to be no basis for comparing 
two quite different traits. The score that results from this type of 
instrument does not have any clear label or interpretation, even if 
it is a relatively good predictor of some particular criterion. It gives 


us little help in building a descriptive picture and an understanding 
of the individual. 


Further research is needed to te 


0 à П whether there are unique advan- 
tages in the forced-choice 


: BU ice pattern or whether the same control on 
rater variations and idiosyncrasies could be obtained by including 


favorable but non-significant items as control variables in the more 
conventional rating pattern, 


lating new development th 
the rethinking that it will 


The forced-choice pattern is a stimu- 
at may have value both for itself and for 
call forth in other rating techniques. 
REFINEMENTS IN THE RATING PROCEDURES 


The best-designed instrument cannot give good results if used 
under unsatisfactory rating conditions, R; 4 cw 
tion they do not have and will not give 
ing to give. We must, therefore t 
close contacts with the r 
butes they have h 


Raters cannot give informa- 
information they аге unwill- 
‚ try to pick raters who have had 
and ask them for judgments on attri- 
ad an Opportunity to observe, We should give 


atees 


REFINEMENTS IN THE RATING PROCEDURES 363 


them some guidance and training in the type of judgments we expect 
them to make, and if possible they should have opportunity to ob- 
serve the ratees affer they have been educated in the use of the 
ratings. When there are several people who know the ratees equally 
well, ratings should be gathered from all of them and pooled. Let 
us consider these points further. 

Selection of Raters. For most purposes, the ideal rater is the per- 
son who has had a great deal of opportunity to observe the person 
being rated in situations in which he would be likely to show the 
qualities on which ratings are desired. (Occasionally it may be de- 
sirable to get a rating of the impression which a person makes on 
brief contact or in a limited experimental situation.) It is also de- 
sirable that the rater take an impartial attitude toward the ratee. 
The desirability of these two qualities, thorough acquaintance and 
impartiality, is generally recognized in the abstract. However, the 
goals may be only partially realized in practice. 

Administrative considerations usually dictate that the rating and 
evaluation function be assigned to the teacher in the school setting 
and to the supervisor in а work setting. The relationship here is in 
ach case one of direct supervision. There is generally a continuing 
and fairly close personal relationship. But the relationship is a one- 
directional and partial one. The teacher or supervisor sees only one 
side of the pupil or worker, the side that is turned toward the "boss." 

Those qualities that a boss has a good chance to see, primarily 
qualities of work performance, can probably be rated adequately by 
the teacher or supervisor. Thus, in one study 7 of airplane mechanics 
it was found that the ratings by a pair of supervisors on “job know- 
pooled ratings by eight coworkers in a 
plane maintenance crew and that the supervisors’ pooled rating cor- 
related 0.53 with a written proficiency test, whereas the pooled rat- 
ing for the coworkers correlated only 0.43. However, those qualities 
that show themselves primarily in relationships with peers or sub- 


ordinates will probably be evaluated more soundly by those same 
The validity of the U. S. Military Academy 


how” were as reliable as the 


peers and subordinates. А х 
peer ratings described on p. 347 is a case in point. . . 
The lack of agreement between supervisor and pupil ratings of 


teachers is suggested in some of the following correlations: 


Pupil's rating of excellence versus principal s rating * . 938 
Pupil's rating of excellence versus composite of 5 judges - 15 
Mean pupil rating of effectiveness versus administrator's rating 0.08 
Student versus administrator rating on general teacher effective- РР" 


ness?! School T i 
School П 0.50 


304 THE INDIVIDUAL AS OTHERS SEE HIM 


A certain amount of overlap does exist, but the ratings appear also 
to have a good deal of uniqueness. The bird’s eye and worm's eye 
views are not the same. 

Who Should Choose the Raters? The selection of persons to rate 
applicants for jobs or fellowships requires consideration from another 
point of view. In this setting, the applicant is usually asked to 
supply a certain number of references or to submit evaluation forms 
filled out by a certain number of individuals. The choice of the 
individuals is usually left up to him, and we may anticipate that he 
will select persons he believes will rate him favorably. It might be 
more satisfactory if the applicant were asked to supply the names 
and addresses of persons who stood in particular relationships to 
him and who should be able to supply relevant information, rather 
than leaving the applicant free to pick his own endorsers. Thus, a 
job applicant might be asked to give the names of his immediate 
supervisors in his most recent jobs; a fellowship applicant, to list 
the name of his major advisor and of any instructors with whom he 
had taken two or more courses. Thus, we are shifting the respon- 
sibility of determining who shall provide the ratings from the appli- 
cant to the using agency. Such a shift Shculd reduce the amount 
of special pleading for the applicant, 

Selection of Qualities to Be Rated. Two principles appear to apply 
in determining the types of information to be sought by rating pro- 
cedures. In the first place, it seems undesirable to use rating pro- 
cedures to get information that can be 
some more objective and reliable indicator. Score on a well-con- 
structed intelligence test is a better indicator of intellectual ability 
than some supervisor's rating of intellect. Production records, if 
satisfactory ones exist, are to be preferred to a supervisor's rating 
for productivity. Ratings are something to which we resort when 
we do not have anv better indicator to fall back on. 

Secondly, we should limit ratings to rel 
that can be expressed in terms of actu 
cannot expect the rater to look inside 
on within. Furthermore, we must bear in mind the extent and na- 
ture of the contact between rater and person rated. For example, à 
set of ratings to be used after a single interview should be limited to 
the qualities that can be observed in an interview. The intervicwec's 
neatness, composure, manner of speech, and fluency in answering 
questions are qualities that are observable in a single interview. His 
industry, integrity, initiative, and ingenuity are not, though these 
qualities might be appraised with some accuracy: by the person who 


satisfactorily provided by 


atively overt qualities, ones 
al observable behavior. We 

5 
the ratee and tell us what goe 


REFINEMENTS IN THE RATING PROCEDURES 365 


has worked with him for a time. Ratings should be of observable 
behavior—observable in the setting in which the man has been ob- 
served. 

Educational Program for Raters. Good ratings do not just happen, 
еуеп with the proper raters and the proper instrument for recording 
the ratings. Raters must be "sold" on the importance of making 
good ratings and taught how to use the rating instrument. Pointing 
Out the importance of "selling" a rating program is easier than tell- 
ing how to do it. As we have indicated earlier, inertia on the one 
hand and identification with the ratee on the other are powerful com- 
peting motives. We cannot provide a course in direct selling at this 
point, but a job of selling needs to be done in almost any program 
of ratings. Furthermore, the selling must continue if thoughtfulness 
and integrity of the appraisals are to be maintained. 

It is desirable that raters have practice with the specific rating 
instrument. A training session, in which the instrument is used 
under supervision, is often desirable. The meanings of the attributes 
can be discussed, sample rating sheets can be prepared, and the re- 
The prevailing generosity error can be 


sulting ratings reviewed. 
Further practice can be 


Noted, and raters cautioned to avoid it. 
given, in an attempt to provide a more symmetrical distribution of 
ratings. Training sessions will not eliminate all the shortcomings of 
ratings, but they should reduce somewhat the more common distor- 
tions considered earlier. 

Observations Made as a Basis for Ratings. One objection to ratings 
is that they are usually made after the fact and are based on general 
unanalyzed impressions about the person rated. An attempt to get 
away from this dependence on general memory is sometimes made 
by introducing the rating program well in advance of the time at 
which the final ratings are to be called for. It is hoped that the raters 
will then be on the alert for and take specific note of behavior relat- 
ing to the qualities that are to be rated. As noted on p. 339, the 
attempt has even been made to provide for systematic recording of 
such observations over a period of time. However, recording of this 
type calls for a high level of commitment to, and cooperation in, 
the rating program. Where that level of involvement is achieved, 
advance notice and systematic recording may be expected to improve 
the rating process. Situations of this sort are probably rare, however. 

Pooling of Ratings by Several Raters. One of the limitations of 
à In those situations in which there are a 


ratings is low reliability. À 
all had approximately equal chance to 


number of persons who have ; В 
Observe the ratee, it may Бе possible to get independent ratings from 


366 THE INDIVIDUAL AS OTHERS SEE HIM 


each potential rater and to pool these into a composite rating. 
Studies have shown that the effect on reliability of pooling inde- 
pendent ratings is essentially the same as the effect of lengthening a 
test. The formula given in Chapter 6 (p. 137) applies. Thus, the- 
oretically we could achieve any needed level of reliability in our ap- 
praisal merely by increasing the number of raters. 

The catch is found in the phrase “equal chance to observe the 
ratee." Unfortunately, the number of persons well placed to observe 
a person in some particular setting, school, job, camp, etc., is usually 
limited. Often only one person has been in close contact with the 
ratee in a particular relationship. He has had only one homeroom 
teacher, only one foreman, only one tent counselor. Others have had 
some contact with him, but it may be so much less that their judg- 
ments add little to the judgment of the rater most intimately in- 
volved. 

Note that we specified the pooling of independent ratings. If the 
ratings are independently made, the "error" components will be in- 
dependent and will tend to cancel out. If, however, the judgments 
are combined through some sort of conference procedure, we cannot 
tell just what may happen. Errors may cancel out, wisdom may win. 
or the prejudices of the most dogmatic may prevail. Pooling inde- 
pendent judgments is the only sure w 


ay of balancing out individual 
errors and has been found in several studies" to be more satisfac- 


tory than the conference type of procedure. 


SUMMARY AND EVALUATION 


In spite of all their limitations, e 
ratings will undoubtedly continue 
tive evaluations in schools, 
educational and psychologic 
and learn to live with it. 
ratings of different aspe 
awareness of the limit 


valuations of persons through 
to be widely used for administra- 
civil service, and industry, as well as in 
al research. We must recognize this fact 
Granting that we shall continue to use 
cts of personality, we should do so with full 


ations of our instrument 
in such a way that these limit 


The limitations of rating pr 


50) 
and we should do % 
ations are minimized. 
ocedures arise out of: 


1. A humane unwillingness to make unfavorable judgments of our 
fellows, which is particularly Pronounced when we identify to some 
extent with the person being rated. 

2. Wide individual differences among rate 
in any event, in leniency or se 


rs in "humaneness 
verity of rating. 


REFERENCES 367 


3. A tendency to respond to other persons as а whole in terms of 
our general liking or aversion and difficulty in differentiating out 
specific aspects of the individual personality. 

4. Limited contact between the rater and person being rated— 
limited both in amount and in type of situation in which seen. 

5. Ambiguity in meaning of the attributes to be appraised. 

6. The covert and unobservable nature of many of the inner as- 
pects of personality dynamics. 

7. Instability and unreliability of human judgment. 


In view of these limitations it is suggested that ratings will pro- 
vide a most accurate portrayal of the person being rated when: 


1. Appraisal is limited to those qualities that appear overtly in 


interpersonal relations. 

2. The qualities to be appraised are analyzed into concrete and 
relatively specific aspects of behavior, and judgments are made of 
these behaviors. 

3. A rating form is deve 
and /or that has controls for r 

4. Raters are used who have 
the individual in situations in which he 
to be rated. 

5. Raters are "sold" 
use of the rating instrument. 


6. Independent ratings of seve 
alified to carry out ratings. 


loped that forces the rater to discriminate 
ater differences in judging standards. 

had the most opportunity to observe 
would display the qualities 


on the value of the ratings and trained in the 


ral raters are pooled when there are 
several persons qu 


Peer-nominating techniques have interesting possibilities for use 
in schools and other group settings. They permit sociometric analy- 


ses of the interpersonal relations of pupils in a classroom or the 


workers in a shop. Guess Who" nominations permit а simple type 


of rating in the сапу grades. 
valuation procedures in which the significance of his ratings 1s 


somewhat concealed from the rater present an interesting possibility 
for civil service and industrial use. This is true particularly when 
are introduced through "forced-choice" tech- 


controls on rater bias 
niques or a correction score. 


REFERENCES 


1. Brookover, W. B., Person-person interaction between teachers and pupils 
and teaching effectivenes J. educ. Res., 1940, 34, 272-287. 


368 THE INDIVIDUAL AS OTHERS SEE HIM 


2. Cook, W., and C. H. Leeds, Measuring the teaching personality, Educ. 
psychol. Meas., 1947, 7, 399-410. 

3. Doll. E. A., Measurement of social competence, Minneapolis, Minn., Edu- 
cational Test Bureau. Educational Publishers, Inc., 1953. 

4. Harrington, W.. Recommendation quality and placement success, Psychol. 
Monogr., No. 252, 1943. 

5. Highland, R. W., and J. R. Berkshire, A methodological study of forced 
choice performance rating, San Antonio, Texas, Human Resources Research 
Center, Lackland Air Force Base, May 1951 (Research Bulletin 51-9). 

6. Horace Mann Lincoln Institute of School Experimentation, How to con- 
struct a sociogram, New York, Teachers College, Columbia University, Bureau 
of Publications, 1950. 

7. Judy, C. J.. A comparison of peer and supervisory rankings as criteria of 
aircraft maintenance proficiency, Doctor of Education Project Report, Teachers 
College, Columbia University, 1952. 

8. Lins, L. J., The prediction of teaching efficiency, J. exp. Educ., 1040. 
15, 2-60. 

9. Moreno, J. L.. Who shall survive?, Washington, D. C., Nervous and 
Mental Disease Publishing Co., 1934, 

10. Personnel Research Section, AGO, Analysis of an Officer Efficiency Re- 
port (WD AGO Form 67-1) using multiple raters, Washington, D. C., Adjutant 
General's Office, 1952 (PRS Report 817). 

11. Personnel Research Section, AGO, A study of officer rating methodology, 
validity and reliability of ratings by single raters and multiple raters, Washing- 
ton, D. C., Adjutant General's Office, 1952 (PRS Report 904). 

12. Personnel Research Section, AGO, Survey of the Aptitude for Service 
Rating system at the U. S. Military Academy, West Point, New York, Washing- 
ton, D. C., Adjutant General's Office, 1953. 

13. Preston, H. O., The development of a procedure for evaluating officers їп 
ан States Air Force, Pittsburgh, Pa., American Institute for Research. 

14. Reed, H. J., An investigation of the rel 
effectiveness and the teacher's attitude of accept 
277-325. 

15. Remmers, Н. H., N. W. Shock, and E. I 
the validity of the Spearman Brown formula 
Scale, J. educ. Psychol., 1927, 18, 187-195, 

16. Richardson, M. W., and С. F 
measures, Person. J.. 1933, 12, 36-40, 


17. Symonds, P. M., Diagnosing personality and conduct, New York, Cen- 
tury, 1931. A 


ationship between teaching 
ance, J. exp. Educ., 1953, 21, 


«a Kelly, An empirical study of 
as applied to the Purdue Rating 


. Kuder, Making a rating scale that 


SUGGESTED ADDITIONAL READING 


Ferguson, Leonard W., Personality 
1952, Chapters 10 and 11, ё 


Jennings. Helen Hall. in association with the Staff of Intergroup Education 
in Cooperating Schools, Hilda Taba, Director, . ociometry in group relations: 
a work guide for teachers, Intergroup Education in Cooperating Schools, Work 


in Progress Series. Washington, D. C., American Council on Education: 
1948. 


measurement, New Vork, Mc( sraw-Hill, 


QUESTIONS FOR DISCUSSION 369 


е Jennings, Helen Hall, Leadership апа isolation: a study of personality in 
inter-personal relations, 2nd ed., New York, Longmans, Green, 1950. i 
Monroe, Walter S.. Editor, Encyclopedia of educational research, rev. ed., 
New York, Macmillan, 1950, pp. 961-965. 
Sisson, Donald E., Forced-choice —the new army rating, Person. psychol., 


1948, 1, 365-381. 
Tiffin, Joseph, Zndustrial psychology, New York, Prentice-Hall, 1947, Chapter 


10 
QUESTIONS FOR DISCUSSION 


1. If you were writing to someone who had been given as a reference by 
a job in your company or for admission to your school, what 
should vou do in order to obtain the most useful evaluation of the applicant? 

2. Make as complete a list as you can of the different ratings used in the 
school that you are attending or the school in which you teach. What type 
of a rating scale or form is used in each case? 

3. In the light of such evidence or opinion as you can obt 
are the ratings that you identified in the previous question? How adequate a 
How consistently is the scale used by different 


spread of ratings is obtained? 
users? What is your impression of the reliability of the ratings? Of their 


freedom from halo and other errors? 

4. What factors influence a rater's willingness to rate conscientiously? 
How serious is this issue? What can be done about it? 

5. Why would three independent ratings from separate raters ordinarily be 
preferable to a rating prepared by the three persons working together as a 


committee? 

6. In the personnel 
called upon to rate job 
following characteristics would you е 
Why? 


an applicant for 


ain, how effective 


1 office of a large company. employment interviewers are 
applicants at the end of the interview. Which of the 
xpect to be rated reasonably reliably? 


a. Initiative. 
b. Appearance. 
. Work background. 


& 
d. Dependability. 
e. Emotional balance. 


7. In a small survey of the report cards used in a number of communities 
the following four traits were most frequently mentioned as found on the re- 
port cards: (a) courteous, (b) cooperative, (c) health habits, (d) works with 
others. How might these be broken down or revised so that the classroom 
teacher could evaluate them better? 

8. Which of the following would influ 
an interview? In what way? 


ence your judgment of a person in 


a. A very firm grip in shaking hands. 
b. Wearing a "loud" necktie. 

c. Generally pausing for a mome 
d. Playing with keys on а key ring. 
e. Having a spot on his vest. | 

f. Looking at the floor all during 


nt before replying to a question. 


the interview. 


370 THE INDIVIDUAL AS OTHERS SEE HIM 


9. Compare the reactions of several class members or of several acquaint- 
ances on the items of question 8. How general are the reactions? What basis 
in fact is there for them? — 

10. What advantages do ratings by peers have over ratings by superiors: 
What disadvantages? А 5 

11. What are the advantages of ranking over rating on a rating scale? 
What are the disadvantages? 

12. Suppose that a forced-choice rating scale had been developed for use 
in rating the teachers in the city school systems in order to get an evaluation 
of their effectiveness. What advantages would this rating procedure have 
over other types of ratings? What problems would be likely to arise in using 
it? ? 

13. Make up a "Guess Who" form that might be useful to a teacher 1n 
finding out about the pupils in his class. 
try the form out and analvze the results. 
in using the results? 


If a class group is available to you. 
What precautions should be taken 


14. Using a class group taught by some class member or made available by 
the instructor, get each child's choices for other children to work on a com- 
mittee with him. Plot the results in a sociogram. What do the results tell 
you about the class and the pupils in it? What limitations would this socio- 
gram have for judging the status of an individual child among his classmat 5 

15. Suppose you have been placed in charge of a merit rating plan which is 


being introduced in some company. What steps would you take to try 10 
get as good ratings as possible? 


Chapter 14 


Questionnaires and 
Inventories for Self-Appraisal 


If we wish to find out about some aspects of an individual's per- 
sonality, such as his likes and dislikes, his interests, or his feelings 
about and attitudes toward people and activities, an obvious ap- 
proach is to ask him about them. No one has so continuous an 
opportunity to observe John Jones as does John Jones himself. And 
no one is so well situated to look inside John Jones. At least, so it 
would seem. Presently we shall raise some questions as to how good 
a view John has of himself. But it is certainly true that he has some 
unique advantages, in comparison with any outsider. He is there 
all the time and has a continuity of exposure to John which nobody 
else has. And he is directly aware of John’s inner life of thought and 
feeling. We certainly cannot afford to dismiss the individual's self- 
observation as a source of insight into aspects of his personality. 

We may try to exploit the individual's knowledge and evaluation 
g him a series of questions focused 
around the areas of our concern. The employment interviewer, the 
counselor, the case worker, and the psychiatrist proceed in this way. 
But the clinical interview is an unstandardized inquiry, highly de- 
r both for the way it is car- 
Furthermore, individual 


of himself in an interview, askin 


pendent upon the particular interviewe 
ried out and for the way it is interpreted. 
very heavy demands upon the time of interviewing 
prohibitive in a number of situa- 
n, and to provide an 


interviews place 
personnel, demands which may be 
tions. To economize on interviewer time, the 
inquiry that is uniform in presentation and procedure for evalua- 
tion, the printed questionnaire has been developed. The self-report 
questionnaire or inventory is essentially this: a standard set of ques- 
tions about some aspect or aspects of the individual’s life history, 


feelings, preferences, or actions presented in a standard way and 


scored with a standard scoring key. 
371 


372 SELF-REPORT INVENTORIES 


THE BIOGRAPHICAL DATA BLANK 


An obvious and important use of the questionnaire is as a mean 
of eliciting factual information about the individual's past history. 
Place and date of birth, amount and type of education and degree 
of success with it, nature and duration of previous jobs, hobbies, 
special skills, and a host of other biographical facts can be deter- 
mined most economically through a blank filled out by the indi- 
vidual himself. It is the economy and efficiency of this approach 
that makes it particularly appealing. "Though his reports may be 
inaccurate in some respects, the individual himself is probably the 
richest single repository for the factual information we would like 
to have about him. Й 

Тһе problems in using questionnaires to elicit facts are primarily 
problems of communication. When questions are preformulated and 
appear in printed form and answers are written down, misunder- 
standing may occur either in the respondent's interpretation of the 
question or in the using agency's interpretation of his response. If 
there is no personal interaction, these misunderstandings cannot be 
cleared up with an oral question or a furthe 
of uncertainty. It is important, the 
tionnaire be very carefully worded 
liminary form with small groups to 
have been cleared out of it. 

Ап interview to supplement the 


r probing into the area 
refore, that a fact-finding ques- 
and that it be tried out in ques 
make sure that the ambiguities 


۹ 2 sirable in 
questionnaire is often desirable: 
order to permit clarification of апу of the responses to questionnair 


А 5 ч ; 1 
items that are puzzling to the user or to get fuller information т 
Some points. Asa matter of fact, onc appropriate use of self-repor 

inventories of all types is to provide a jumping-off place for an inter- 


3 R E e in 
view, the questionnaire providing leads that may be followed up i 
the interview, 


Sometimes the f. 
other fact-finding 
the individual те 


actual information оп ап application blank OF 
questionnaire has been used to determine whethet 
ets certain stated requirements to be eligible for а 
job, educational program, or the like, Sometimes it has been used 
as part of the raw material from which the personnel officer, director 
of admissions, or scholarship committee makes a clinical judgment 
of the individual's desirability as ап employee or student. In a T 
instances, however, biographical data blanks have been analyze 
item by item to determine to wh 
each item actually 

found to discriminate 


: ses tO 
at extent particular responst i 
: VOL A ms 
predict some criterion of job success. Ite [Б 
:adividuals 

more successful from less successful individuz 


INTEREST INVENTORIES 373 


аге given a score credit, and the separate items are summed to give 
à score for the blank as a whole. Thus, the World War II programs 
for selecting pilot trainees for both the Army and the Navy used a 
scored biographical data blank that was treated just as if it were a 
test. The life-insurance companies have for a number of years used 
an Aptitude Index in selecting insurance salesmen, one section of 
Which consists of factual items about the individual applicant. 
Thus, the individual is asked about the amount of insurance he him- 
self carries, his net worth, etc. А scoring system assigns scores for- 
each response in terms of the success experienced by those in the 
validation group who had given that response. 

In the examples given above, objective scoring of a biographical 
data blank provided one of the most valid predictors of job success. 
These results suggest that there may be a number of other selection 
situations in which a standard scoring procedure could be used with 
advantage. The development of scoring weights is a major under- 
taking, but once a scoring system has been developed the scoring of 
individual blanks proceeds rapidly. It has even been possible in 
military use of biographical inventories to prepare them in multiple- 
choice form and score them like any standard test. 


INTEREST INVENTORIES 


One aspect of the individual's make-up that we would like to 
Study, both to understand him as a person and to help in such im- 
mediately practical problems as educational and vocational guid- 
ance, is the domain of interests and aversions, preferences for ac- 


tivities and surroundings. Of course, in the matter of vocational 


interests, the simplest procedure would seem to be to ask the indi- 
vidual how much he would like to be an engineer, for example. 
However, this doesn't work out very well in practice. In the first 
place, people differ in the readiness with which they exhibit enthu- 
"Like very much" for person А may signify no more enthu- 


siasm. 
In the second place, people differ 


Slasm than “like” for person B. 
substantially in the nature and completeness of their understanding 


of what a particular job means in terms of activities and conditions 
of work. Engineer“ to one person may signify primarily out-of- 
doors work; to another it may carry a flavor of the laboratory or 
drafting board; to still another it may signify vaguely a high-prestige, 
Science-oriented job. These varied and incomplete meanings cause 
à response to the single question, "How much would you like to be 
an engineer?" to be a rather unsatisfactory indicator of the degree 


374 SELF-REPORT INVENTORIES 


to which the individual has interests really suitable for the profes- 
sion of engineering. It is for these reasons that psychometricians 
have undertaken to broaden the base of information and to ask a 
whole array of questions about the individual's likes and dislikes, 
rather than simply to ask directly about preference for particular 
jobs. 


THE STRONG VOCATIONAL INTEREST BLANK 


One of the best known instruments for appraising interests is the 
Strong Vocational Interest Blank for Men. This inventory is made 
up of 400 items, broken up into the types listed here: liking for 
occupations, liking for amusements, liking for activities, reaction to 
peculiarities of people, choice or preference between activities, and 
evaluation of personal abilities and characteristics. A companion 
instrument exists for women, but we shall discuss the blank for men, 
because it has been more fully developed and most of the research 
work has been done upon it. 

To most of the 400 items in the Strong blank the individual re- 
sponds by marking one of the three given options L, I, and D (Like, 
Indifferent, Dislike), A response is called for to each item. Over 
40 different scoring keys have been developed for the men's blank. 
Most of these are for Specific occupations, largely at the professional 
level, such as architect, chemist, lawyer, or YMCA secretary, though 
there are also keys for masculinity of inte 
level. ` 

The scoring key for e 


sts and occupational 


ach occupation was based on the comparison 
of a group of men who were successfully engage 
with a reference group of men-in-general, Thus, the per cent of men 
in occupation A choosing the L, I, and D options to item 1 is com- 
pared with the per cent of men-in-g 
tions. If enough more men in occupation A choose a particular op- 


tion, that option receives a plus score for occupation A. If the per 
cent is smaller for occupation A, the 


If the per cent for occupation A is ve 


d in that occupation 


eneral choosing these same op- 


option receives a minus score. 
ry much larger or smaller than 
for men-in-general, the score may be as much as +4 or —4. Smaller 
scores are assigned to smaller differences, 
weighted to take account of the sharpne 
criminates. 


Thus, responses are 
ss with which the item dis- 


Table 14.1 shows the scoring kev for the 
blank for four different occupational keys. 
for the different items. Note that some 
given item may receive a zero weight. 


first ten items in the 
Note the range of weights 
or all of the options for a 


THE STRONG VOCATIONAL INTEREST BLANK 375 


Table 14.1. Scoring Weights for Sample ltems and Keys of Strong 
Vocational Interest Blank for Men 


Scoring Key 


Social 8 Production 
Engineer Teacher Farmer Manager 

Item L I D L I D ЭШЕ iB, L LI B 

Actor (not movie) -1 0 1 1 9 de 8 oo 0 
Advertiser za d 2 0 1 = -2 1 1 -10 1 
Architect 2 p =i -1 0 1 00 0 Ж 50 
Army officer 1 0 -1 1 0 —1 оо 0 Ж 0 
Artist g 9 © ei 0 o9 -10 1 -1 0 0 
Astronomer 1 0 —1 -1 0 0 -1 0 1 0 0 0 
Athletic director —1 1 0 کت‎ =7 0 0 0 00 -i 
Auctioneer =i =i 2 0 I = 0X3 cm 0 0 1 
Author of novel -1 1 0 1 0 =i =й 1 = 0 0 
Author of technical book 3 ~1 ~2 0 1 0 -1 0 1 10 -1 


An individual's score is obtained by summing up the plus and 
minus credits corresponding to the responses he has chosen. Since 
ights are different for each occupation, a separate scoring key 
a separate score is obtained for the examinee for 
s of scores is obtained showing how closely 
xaminee correspond to those typically 
scores are trans- 
ents the mean for 


the we 
is required, and 
each scale. Thus, a serie 
the responses given by our e 
given by each specific occupational group. Raw 
lated into a standard score scale in which 50 repres 
men in the specific occupation. A scale of letter grades is also pro- 
vided, in which А represents close resemblance to the particular oc- 
cupational group, B+, В, and В— lesser degrees of resemblance, and 
C+ or € interest patterns quite different from those of the particular 
occupational group. 

Table 14.2 shows the standard scores and letter ratings on the 
of the blank for a specific college freshman. 
Thus, this young man shows interest patterns resembling closely (A) 
those of chemists, farmers, and mathematics and science teachers. 
sts are also quite like (B+) those of physicians, dentists, 
His interests are very unlike (C—) those 
hool superintendents, ministers, and 


occupational scales 


His intere 
engineers, and carpenters. 
of YMCA secretaries, city sc 
life-insurance salesmen. 
Originally, scoring the 


Strong Vocational Interest Blank was a very 
task because of the large number of different scores 
Hand-scoring a blank was a matter of several 
th century electronics has hit the test- 
vice developed by E. J. Hankes has 


time-consuming 


that are called for. 
hours’ work. But twentie 
scoring field, and a special de 


to score the blanks at very high speed. This scoring 


made it possible 


376 SELF-REPORT INVENTORIES 


Table 14.2. Scores on Strong Vocational Interest Blank for a 
College Freshman 


Standard Letter 
Occupation Score Rating 
I. Artist 26 C+ 
Psychologist 22 C 
Architect 29 C+ 
Physician 42 B+ 
Dentist 41 B+ 
II. Mathematician 26 C4 
Engineer 44 B+ 
Chemist 52 A 
ПІ. Production manager 39 B 
IV. Farmer 59 A 
Carpenter 44 B+ 
Math and science teacher 48 A 
V. YMCA physical director 34 3 — 
Personnel manager 21 C 
YMCA secretary Low * с— 
Social-science teacher 17 € 
City school superintendent Low * c= 
Minister Low * c= 
VI. Musician 25 C+ 
ҮП. CPA 16 @ 
VIII. Accountant 25 C+ 
Office worker 25 C+ 
Purchasing agent 28 e+ 
Banker 22 C 
IN. Sales manager 19 Ёё 
Real estate salesman 17 G 
Life-insurance salesman Low * C= 
N. Advertising man 19 (е 
Lawyer 20 С 
Author-journalist 24 С 


*" Low" designates a standard score of 15 or lower. 


THE KUDER PREFERENCE RECORD (VOCATIONAL) 377 


machine is available only at Engineers Northwest, Minneapolis, 
Minnesota. The special answer sheets must be sent to this organiza- 
tion, where they will be scored at a cost that is a fraction of what 
the cost would be by hand methods.“ 

There are two points about the construction of the Strong blank 
to which we wish to call especial attention at this time. In the first 
place, the person taking the test responds by choosing one of a set 
of response categories for each item (L, I, D). A particularly effu- 
sive individual could choose all L's, and a particularly jaundiced one 
could choose all D's. There is a certain amount of freedom to im- 
pose one's own standards upon the task. Secondly, the keys are ex- 
ternally determined. That is, they are defined by the responses of 
a particular job group and not by any internal logic. We wish now 
to contrast with the Strong Blank the Kuder Preference Record, 
which is different with respect to both of these features. 


THE KUDER PREFERENCE RECORD (VOCATIONAL) 


The Kuder Preference Record is made up of triads, or sets of three 


options. Typical sets might read: 


Go for long hikes in the woo 
Go to a symphony concert. 
Go to an exhibit of new inventions. 
Fix a broken clock. 
Кеер a set of accounts. 
Paint a picture. 
In each set the individual is required te mark the one he would like 
to do most and the one he would like to do least. 

Scoring keys were established on the basis of the internal relation- 
ships of the items. Thus, a study of the responses to the items showed 
that a number of items dealing with mechanical activities tended to 
hang together. If a person chose one he was likely to choose others, 
and if he rejected one he was likely to reject the others. Moreover, 
items in this group showed relatively little relationship to the re- 
maining items. The items grouped together in a distinct cluster. 
From the nature of the items it was evident that this cluster related 
to mechanical interest. Those items having a substantial correlation 
with this cluster were included in a scoring key that gave a score for 
mechanical interest. 

In the same way, other clusters were identified in which the items 
went together but were largely independent of items not in the 

A price of 70e per answer sheet was quoted in 1955 for scoring blanks in quantity 


lots. 


378 SELF-REPORT INVENTORIES 


cluster. Scoring keys were developed for these. The Preference 
Record now yields scores for the following interest clusters: outdoor, 
mechanical, computational, scientific, persuasive, artistic, literary, 
musical, social service, and clerical. Raw scores are converted into 
percentiles, separate norms being supplied for male and female high- 
school students and for male and female adults. 

In Table 14.3, the Kuder scores are given for the same college 
freshman whose Strong scores were shown in Table 14.2. On the 


Table 14.3.  Kuder Preference Record Scores of a College Freshman * 


Percentile 


Interest Area Raw Score Equivalent 
Outdoor 71 95 
Mechanical 58 87 
Computational 17 16 
Scientific 60 93 
Persuasive 25 07 
Artistic 30 68 
Literary 23 78 
Musical 12 45 
Social Service 36 46 
Clerical 19 01 


* Scores for same individual shown in Table 14.2. 


Kuder, this young man stands highe 


$t on outdoor, scientific, and 
mechanical interest, 


He is very low on clerical and persuasive in- 
terests. These findings can be studied in relation to his interest in 
specific occupations, as shown in Table 14.2. The two sets of results 
are obviously consistent and support one another. 


COMPARISON OF STRONG AND KUDER INVENTORIES 

Note that in the Kuder Preference Record, 
to pick a most liked and a least liked activity in each set. No mat- 
ter how much or how little he likes all three, one must be preferred 
and one rejected. This forced-choice pattern appears in a number 
of inventories and should be contrasted with the 
pattern found in the Strong. The forced-choice p 
lar to the one we discussed 


the examinee is forced 


category-response 
attern here is simi- 

rating procedures 
It forces a common frame of 
Differences in general optimism are con- 
trolled. Everyone must express the same 
rejections. Thus, superficial differences in 
what has been called "response set," 


in connection with 
and has somewhat the same features. 


referen се upon everyone. 


number of preferences and 
standards of judgment, or 
are eliminated. But so also 


RELIABILITY, VALIDITY, AND PERMANENCE OF INVENTORIED INTERESTS 379 


are genuine differences in interest level. Whether the forced-choice 
pattern produces a net gain in this respect is still a matter of debate. 
The forced choices also probably facilitate somewhat concealing the 
purpose of the instrument and getting the examinee to reveal him- 
self without his intending to do so. We will see this more clearly 
in an adjustment inventory. і 

Note again that in the Preference Record the several scores relate 
to coherent interest clusters rather than to something outside the 
individual or the test. The scores carry their own relatively direct 
meaning in terms of the common theme running through the cluster 
of items. The meaning does not have to be inferred by thinking 
what lawyers or salesmen are like. If our purpose is to build up a 
meaningful description of an individual, the internally consistent 
scales appear more satisfactory than those that are externally 
oriented. To say that a person is high on mechanical, scientific, 
and out-of-doors interests and low on clerical and persuasive is more 
directly interpretable than to say he is high on interests character- 
istic of farmers, chemists, and mathematics-science teachers and low 
on those characterizing ministers and YMCA secretaries. Internally 
coherent clusters definable in terms of their common theme "make 
sense" better than job-oriented appraisals. 

When it comes to rating the individual for a specific job, however, 
the situation is radically changed. If our concern is to help the in- 
dividual decide whether he would be content in the job of engineer, 
it is much more directly relevant to know how well his interests 
correspond to those of successful engineers than to know how high his 
mechanical and scientific interests are. In the first case, the scoring 
key itself defines what the interests of engineers are; in the second 
case we must either infer this or determine it from a separate study. 

Either the internally consistent or job-oriented approach to inven- 
lorying interests is possible, but which will work better probably 
If our purpose is to appraise 


depends on our particular purpose. 
appropriateness of interests for a limited number of specific jobs, 
this may be done effectively with a specific job key for each job. 
If, however, our conce 


person and perhaps to be prepared to use that de: 
any one of a very large number of 


rn is to get a meaningful description of a 
scription to make 
inferences as to his suitability in 
jobs, then the homogeneous cluster scores seem preferable. 
RELIABILITY, VALIDITY, AND PERMANENCE OF INVENTORIED INTERESTS 


The Strong Vocational Interest Blank is one of the most thor- 
oughly investigated psychometric tools we have, and, though the 


380 SELF-REPORT INVENTORIES 


history of the Kuder Preference Record is shorter, it too has been 
intensively studied. Both instruments yield scores that are reason- 
ably reliable for individuals in their teens or over. Thus, for 285 
Stanford University seniors Strong ® reports odd-even reliabilities 
for the separate occupational scales ranging from 0.73 to 0.94, with 
an average value of 0.88. A number of reliability studies with the 
Kuder, based on analysis of a single testing, give values averaging 
about 0.90. The reliability of the scores extracted from the interest 
inventories compares favorably with scores on ability tests. 

For the Strong “ЇЗ there is evidence that interests show a good 
deal of stability over time, at least in adolescents and adults. Data 
on the average correlation at different ages and over different periods 
may be summarized as follows: 


Upper 
Elementary High College College 
School School Freshmen Seniors 
lor 2 vears 0.55 0.65 0.80 A 
3 to 5 vears 0.30 0.75 0.75 
6 to 10 vears 0.55 0.70 


The stability is low in the elementary school, but for persons of col- 
lege age stability compares favorably with that for intelligence tests. 

In appraising the validity of an interest inventory as a description 
of how the individual feels about activities and events in the world 
about him, the main issue is the truthfulness of his responses. There 
isn't really any higher court of appeal for determining a person's 
likes and preferences than the individual’s own statement. 

A number of studies have indicated that inventories such as the 
Strong ^? can be faked. If a group of examinees is told to try to re- 
spond the way that life-insurance 


salesmen would, they are generally 
rather successful in making themselves appear like life-insurance 
salesmen. However, this is no indication that the blank will be 
faked, even when used as an employment device. 

When the inventory is used for counseling and to help the re- 
spondent, as is most often the case, there is probably little reason to 
anticipate intentional faking. The individual may be expected to 
report his likes and dislikes as he knows them. His self-knowledge 
is perhaps imperfect, so his reports may be inaccurate in some re- 
spects. Thus, he may say that he would like to attend symphony 
concerts because he feels that that is the thing to say, but his actions 
may belie his statement; he may in fact avoid concerts whenever 
they come his way. This lack of self-insight is a real problem. 


INTEREST AND ABILITY 381 


But it is probably mitigated somewhat in the inventory approach to 
interests, where isolated points of poor insight will have only minor 
effects upon a final score. 

The validity of interest inventories as predictors of later behavior 
is another matter. Scoring kevs for the Strong were established by 
comparing men who were already in the occupation with men-in- 
general. The Kuder occupational interest profiles were also pre- 
pared by determining the average level on each of the interest areas 
for individuals already working in the occupation. But the common 
interest patterns of individuals in a field of work may have grown 
out of their work. The men may have come to exhibit certain com- 
mon patterns from the very nature of their work experience. The 
crucial evidence on predictive validity would come from testing a 
group before they entered the world of work and determining whether 
those who later entered and continued in a particular occupation had 
distinctive interest patterns before they entered the occupation. This 
is an expensive operation, expensive in the time that must elapse 
before men can become settled in their occupation and expensive in 
the dissipation of cases among literally hundreds of occupations. 

Strong ® has been able to follow some groups who were tested as 
college undergraduates and does have some evidence on the extent 
to which students with interests characteristic of a particular occu- 
pation tended to enter that occupation and to persist in it. For the 
typical individual, the occupation in which he was actually working 
10 years later ranked second or third for him among all the scales 
of the Strong. Considering group averages, those who remained in 
àn occupation received higher interest scores for that occupation 
than for any other occupation and higher than those who switched 
to some other occupation. Thus, what evidence there is stands in 
support of the validity of the interest scores, though the evidence is 
admittedly limited. 


INTEREST AND ABILITY 

It is important not to confuse measures of interest and ability. 
The fact that a boy scores high on the scientific interest scale of the 
Kuder or on the physicist scale of the Strong is no guarantee that he 
possesses the intellectual and other aptitudes required to master the 
concepts of physics and become a physicist. Interest measures tell 
us nothing directly about abilities, though, as we shall see in a mo- 
ment, there are certain relationships between abilities and interests. 
Interest measures and ability measures deal with two quite distinct 
aspects of fitness for a field of study or work. Each provides infor- 


382 SELF-REPORT INVENTORIES 


mation that supplements the other. Interest is not a substitute 
for ability, and, conversely, ability to learn the skills of a job is no 
guarantee of success or satisfaction in the job. 

There have been many studies of the relationship between interest 
and ability.* Most of these have related to aspects of academic 
work. In general, the relationship between achievement in a field 
such as science and the corresponding interest (i.e., scientific interest 
on the Kuder) is positive but low. Correlation of achievement with 
interest in the corresponding arca will run about 0.30 to 0.50. Thus, 
there is some slight tendency for those with high ability for a field 
of knowledge to show high interest in it. But the relationship is 
much too low for either type of measure to serve in place of the 
other. Both types of information are needed for any sound evalua- 
tion of an individual's suitability for a particular program of study 
or plan for work. 

Standardized interest inventories have been developed primarily 
for their contribution to vocational counseling and job placement. 
With this purpose in mind, they are directed at groups of high-school 
age or older. The Kuder, with its relatively general interest arcas, 
has been used satisfactorily at about the ninth grade and above. 
The Strong. focusing on specific occupations and with a particular 
emphasis upon occupatións at the professional level, is suitable pri- 
marily for senior high school pupils with definite plans to go to col- 
lege and with college groups. As in almost all inventories, these in- 
struments involve a good deal of reading. 
uals who fall below eighth or ninth grade 
ably present serious problems. 
in Appendix III. 


Their use with individ- 
reading level would prob- 
Several other inventories are listed 


TEMPERAMENT AND ADJUSTMENT INVENTORIES 


Self-report inventories have been extensively developed in the 
areas of temperament and personal adjustment. In these areas we 
again encounter instruments developed to yield scores for internally 
consistent clusters of behaviors, as did the Kuder Preference Record, 
and instruments built with kevs based on reference to some external 
criterion, as was the Strong Vocational Interest Blank. 

The basic material of all temperame 


nt and adjustment question- 
naires is much the same. They consist of an extensive catalogue of 
statements about actions and feelings. To these the individual re- 


sponds by indicating whether each is or is not characteristic of him. 


* See Frandsen for a review of some of this material. 


THE GUILFORD-ZIMMERMAN TEMPERAMENT SURVEY 383 


In many cases, a^?" or "uncertain" category is provided for the 
person who does not wish to endorse an unequivocal "Yes" or "No" 
answer. In the case of adjustment questionnaires, questions are 
culled from case studies, writings on various types of adjustment 
problems, suggestions of psychiatrists, and similar sources. For the 
normal dimensions of temperament, a review of psychological and 
literary treatments of personality differences and a systematic scru- 
tiny of previous questionnaires, together with the personal insights 
of the investigator, provide the raw material for assembling items. 

There are a large number of temperament and adjustment inven- 
tories. We will describe three in some detail, illustrating distinc- 
tively different patterns. These are the Guilford-Zimmerman Tem- 
perament Survey, the Minnesota Multiphasic Personality Inventory 
(MMPI), and the Shipley Personal Inventory. Then we will under- 
take a more general evaluation of the validity of inventories in this 
area and of the conditions under which we may expect them to be 
of value. 


THE GUILFORD-ZIMMERMAN TEMPERAMENT SURVEY 

The Guilford-Zimmerman Temperament Survey is a relatively re- 
cent member of a long series of schedules developed to identify dif- 
ferent dimensions of personality from the interrelationships of the 
test items themselves. That is, it tries to identify internally con- 
tent or homogeneous segments of behavior that are rather sharply 
distinguished from other aspects. If "sociability" appears to be 
such a segment of behavior, the attempt is to get a cluster of items 
relating to the “sociability” dimension. If these hang together, so 
that a respondent who subscribes to one as describing him is likely 
also to subscribe to the others, and if they are responded to differ- 
ently from those dealing with "dominance," for example, we may 
feel that we have identified and appraised a distinct and potentially 
significant aspect of the individual. The detailed statistical pro- 
cedures of factor analysis, by which the separate clusters are identi- 
fied and refined, have become quite complicated and go far beyond 
the scope of this book. But the general purpose is fairly simple: to 
locate clusters of behaviors that hang together and are distinct from 
the other clusters studied and to develop a measure of each cluster, 
or "trait," if you will. This is the same approach that we saw in 
the Kuder Preference Record. 

The Guilford-Zimmerman inventory provides scores appraising the 
clusters named and characterized below. Each cluster is character- 


ized both by descriptive phrases and by two illustrative items. 


384 SELF-REPORT INVENTORIES 


General Activity. A high score indicates rapid pace of activities; 
energy, vitality; keeping in motion; production, efficiency, liking for 
speed; hurrying; quickness of action; enthusiasm, liveliness. 


Sample Items 


You start to work on a new project with a great deal of enthusiasm. 
(EJ 


You are the kind of person who is “on the go" all of the time. (+) 


Restraint. A high score indicates serious-mindedness; deliberate- 
ss; persistent effort; self-control: nof being happy-go-lucky or care- 
free; not seeking excitement. 


ne: 


Sample Items 


You like to play practical jokes upon others. (—) 
You sometimes find. yourself “crossing bridges before you come to 


them.” (+) 

Ascendance. A high score indicates habits of leadership; a tendency 
to take the initiative in speaking with others: liking for speaking in 
public; liking for persuading others; liking for being conspicuous; 


tendency to bluff; tendency to be self-defensive, 


Sample Items 


You can think of a good excuse when vou need one. (+) 


You avoid arguing over a price with а clerk or salesman. (—) 


Sociability. A high score indicates one who has many friends and 
acquaintances; who seeks social contacts; who likes social activities; 
who likes the limelight; who enters into conversations; who is nol shy. 


Sample Items 


You would dislike very much to work alone in 


' 1 some isolated place. (+) 
Shyness keeps vou from being as popular 


as you should be. ( 


Emotional Stability. A person with a high score shows evenness 
of moods, interests, etc. ; optimism, cheerfulness; composure; feelings 
of being in good health; freedom; from feelings of guilt, worry, or 
loneliness; freedom from day dreaming; freedom from perseveration 
of ideas and moods. 

Sample Items 


You sometimes feel “just miserable" for no good reason atall. (—) 
You seldom give vour past mistakes а second thought. (+) 


Objectivity. The high scorer is defined as free from the following: 
cgoism, self-centeredness; suspiciousness, fancying hostility; ideas o. 


THE GUILFORD-ZIMMERMAN TEMPERAMENT SURVEY 385 


reference; a tendency to get into trouble; a tendency to be thin- 
skinned. 
Sample Items 


You nearly always receive all the credit that is coming to you for 
things vou do. (+) 


There are times when it seems everyone is against you. (—) 
Friendliness. High scores signify respect for others; acceptance of 
domination; toleration of hostile action; freedom from hostility, re- 
sentment, or desire to dominate. 


Sample Items 


When you resent the actions of anyone, you promptly tell him so. (—) 
You would like to tell certain people a thing or two. (—) 


Thoughtfulness. The high-scoring person is characterized as reflec- 
tive, meditative; observing of his own behavior and that of others: 
interested in thinking: philosophically inclined; mentally poised. 


Sample Items 


You are frequently “lost in thought." (+) 
You find it very interesting to watch people to see what they will do. 


(+) 


Personal Relations. High scores signify tolerance of people; faith 
in social institutions; freedom from self-pity or suspicion of others. 
Sample Items 
There are far too many useless laws that hamper an individual's per- 
sonal freedom. (—) 


Nearly all people try to do the right thing when given a chance. (+) 


Masculinity. The high-scoring person is interested in masculine 
activities: not easily disgusted; hardboiled ; inhibited in emotional 


resistant to fear; unconcerned about vermin; little inter- 


expression; 
ested in clothes, style, or romance. 


Sample Items 


You can look at snakes without shuddering. (+) 


The sight of ragged or soiled fingernails is repulsive to you. (—) 


Since each of these clusters can be thought of as a dimension hav- 
ing two ends, just as we have north and south, east and west, there 
is an opposite end of the dimension that can be characterized as 
just the reverse of the descriptions given above. Items marked (—) 


characterize this opposite end. Of course, most people do not score 


386 SELF-REPORT INVENTORIES 


at either extreme on these dimensions. Here, as elsewhere, a con- 
tinuous range of variation with most people occupying an interme- 
diate position is the characteristic pattern. Most people are neither 
outstandingly active nor conspicuously lethargic, neither clearly 
ascendant nor clearly submissive. People can rarely be well de- 
scribed by clear-cut personality types. They are described as show- 
ing different traits in varying degrees. 

Choosing the names for the clusters presented above was a bit of 
a problem, because the clusters do not correspond exactly to the 
language labels we bring with us. Each cluster is defined by the 
items that went into it and that were grouped together because 
they actually went together in the responses of people taking the 
inventory. The titles are approximate. Each cluster can be under- 
stood more exactly only by a close study of the items that have gone 
into it. 

Table 14.4 * shows the reliabilities of the separate scores, and the 
intercorrelations of the scores. The reliabilities cluster about 0.80 
and are adequate, though not strikingly high. The attempt, in de- 
veloping this inventory, was to identify a number of relatively inde- 
pendent aspects of personality, This means that the correlations of 
the different scores should be low. They tend to be. However, cer- 
tain of the scores show rather substantial correlations. Attention 
may be directed to Ascendance and Sociability, Emotional Stability 
and Objectivity, Friendliness and Personal Relations, and Restraint 
and Thoughtfulness. These pairs of scores are far from independent, 
and the information provided by the scores is overlapping. In a 
SENSE; the inventory is only partially efficient because of the dupli- 
cation in the different scores. It is as if we were in part saying the 
same thing over again. In most ca however, cach score provides 
information about a new and distinctive aspect of the individual. 


" J В я 
People who read about tests and testing will frec 


tables of correlations like Table 14.4. In the table the column at the left lists the 
different variables and numbers them in order. The numbers (but not the names) 


are repeated across the top of the Look at the row labeled “1 General 


activi The numbers that appear in this row are the correlations of “general 


activity” with each of the other variables. The first figure, —.16, is the correla- 
tion between “general activity" and variable 2, “restraint.” This means that there 
is a slight tendency for high scores on the general activity scale to go with low 
scores on the restraint scale. The next figure, t 
activity” with “a 


juently have occasion to study 


table. 


4, is the correlation of "general 
cendance," and the other entries are 
way. The correlation between any 


10 be read in the same 
ables will be found in the row and 
ariables. In this table, the reliability 
a column at the extreme right. 


two vari. 
column whose numbers correspond to those v; 
coefficients for the variables are shown in 


THE MINNESOTA MULTIPHASIC PERSONALITY INVENTORY 387 


Table 14.4. Intercorrelations and Reliabilities of the Ten Scales of the 
Guilford-Zimmerman Temperament Survey 


Intercorrelations 


Scale 2 3 4 5 6 7 8 

1 General activity —.16 .34 .35 34 .14 —.17 24 
2 Restraint —.08 — . 21 .08 . 05 ad EX 
3 Ascendance „61 398 4 1 
4 Sociability 23 .36 —.06 04 
5 Emotional stability 0 37 —.13 
6 Objectivity 34 —.04 
7 Friendliness — 03 
8  Thoughtfulne 

9 Personal relations 

10 Masculinity 


* Kuder-Richardson formula, based on 912 college students. 


The Guilford-Zimmerman Inventory has several characteristics that 
it may be well to summarize at this time. 


1. It is based upon the responses of normal everyday people, not 
of the overtly maladjusted or the institutionalized. 

2. Its scales are set up by internal analysis, by study of the 
“going together" of groups of items. 

3. Responses are taken at face value. Their significance is assumed 
to be given by their obvious content. 

4. The respondent may endorse as many or as few of the items as 
he wishes; his choices are not forced or constrained. 


By contrast, let us consider the Minnesota Multiphasic Personality 
Inventory, which differs radically with respect to the three first 


features. 


THE MINNESOTA MULTIPHASIC PERSONALITY INVENTORY 

The Minnesota Multiphasic Personality Inventory was developed 
to identify a number of distinct categories of abnormal behavior. 
A pool of items was gathered which referred to different types of 
psychopathology: hysteria, depression, hypochondriasis, paranoid 
tendencies. The pool of items was tried out on a group of "nor- 
mals" * and upon a number of different groups with specific patterns 
of symptoms of maladjustment. The procedure was essentially the 
same as that for the Strong. Items were scored when they distin- 

* The problem of selecting a group of normal and well-adjusted persons is often 
a harder one than selecting people with a particular type of pathology. А particu- 
an be identified with a good deal of definiteness, but absence 


lar type of disease 
of disease is a fuzzier notion, harder to define and to identify. 


388 SELF-REPORT INVENTORIES 


guished a given pathological group from the group of “normal” 
control cases. 

The different scales of the MM PI are described below and illus- 
trated with sample items. It must be remembered that the scales 
were established by using groups of patients showing behavior that 
was judged to be definitely abnormal. We cannot automatically 
apply the same labels to the variation in these traits that appears 
among normal individuals. The interpretation of scores found for 
normal persons must be made with caution. 

Hypochondriasis Scale (15). This scale assesses the amount of 
abnormal or excessive concern with bodily functions. A high score 
indicates undue worry about health, often accompanied by reports 
of obscure pains and disorders that are difficult to identify. 


Sample Items 
I do not tire quickly. (—) 
The top of my head sometimes feels tender. (+) 


Depression Scale (D). This scale appraises a tendency to be chron- 
ically depressed, to feel useless and unable to face the future. 


Sample Items 


I am easily awakened by nois (+) 
Everything is turning out just like the prophets of the Bible 
would. (+) 


said it 


Hysteria Scale (ТТу). This scale gets at the tendency to solve per- 


sonal problems by developing physical symptoms, such symptoms as 
paralyses, cramps, gastric 


or intestinal complaints, or cardiac symp- 


toms. The symptoms tend to appear under emotional stress and to 
be used as an escape mechanism, 


Sample Items 
I am likely not to speak to people until they speak to me. (+) 
l get mad easily and then get over it soon. (+) 
Psychopathic Deviate Scale (Ра). This scale was based upon a 
group who showed absence of dee 


1 ; p emotional response, inability to 
profit from experience, and disregard for social pressures and the re- 
gard of others. They are individuals who, from their 


1 3 0 disregard of 
social pressures, are likely to get into trouble 


of various sorts, 
Sample Items 


My family does not like the work I have chosen. (4-) 
What others think of me does not bother me. (+) 


THE MINNESOTA MULTIPHASIC PERSONALITY INVENTORY 389 


Paranoia Scale (Pa). The qualities evaluated by this scale are 
suspiciousness, oversensitivity, and feelings of being picked on or 
persecuted. 

" Sample Items 


I am sure I am being talked about. (+) 
Someone has control over my mind. (+) 


Psychasthenic Scale (Pt). This scale was based on patients who 
were troubled with excessive fears or with compulsive tendencies to 
dwell on certain ideas or perform certain acts. High score indicates 
resemblance to this group. 


Sample Items 


І easily become impatient with people. (+) 
I wish I could be as happy as others seem to be. (+) 


Schizophrenic Scale (Sc). This scale is based upon a group of pa- 
tients characterized by bizarre and unusual thought or behavior, 
апа a subjective life tending to be divorced from the world ofreality. 
High scores indicate responses similar to this group. 


Sample Items 


1 have never been in love with anyone. (+) 
I loved my mother. (=) 


Typomania Scale (Ma). This scale evaluates a tendency to be 
overactive both bodily and mentally, with a tendency to skip around 


rapidly from one thing to another. 
Sample [ems 


I don't blame anyone for trying to grab everything he can get in this 


world. (+) 


When I get bored I like to stir up some excitement. (+) 


Masculinity-Femininity Scale (Mf). This scale measures interests 
characteristic of the one or the other sex. 


Sample Items 


I like movie love scenes. (F) 
I used to keep a diary. (F) 


The ALM PI has a number of additional features, and these focus 
ms that arise in using adjustment ques- 


attention on certain proble i i 1 
se features is a lie scale (1). This is 


tionnaires. The first of the 


390 SELF-REPORT INVENTORIES 


based upon 15 items, imbedded in the questionnaire, that relate to 
socially approved and virtuous activities (in the manner of the lving 
test of May and Hartshorne described on p. 302). General popula- 
tion norms indicate what may reasonably be expected on a set of 
items of this sort. If a person marks an excessive number of these 
socially approved behaviors, it is considered to be an indication 
that he tends, consciously or unconsciously, to distort his report so 
that he appears in a favorable light. That is, he tends to "fake 
good." 

Another score, the K scale, was built up by keying items that dis- 
tinguished known abnormals who had presented normal score pro- 
files from a control group of normals. A high score on this scale is 
thought to indicate a tendency to be very defensive in self-evaluation, 
whereas a low score brings out the tendency to be extremely self- 
critical, i.e., to “fake bad." 

The ? score is based upon the number of ? or undecided responses. 
А very high number is thought to indicate a tendency to evade the 
task imposed by the inventory: to withdraw from it and fail to face 
up to it. 

One further control scale is the F scale, made up of an assortment 
of unrelated items, each of which is rarely marked in the general 
population. A high score on this scale is thought to be symptomatic 
of careless and superficial marking of the inventory: of marking 
items at random or misunderstanding the questions. 

Thus, the authors of the MMPI have introduced a whole 
of control scales, designed to isolate individuals whose responses are 
untrustworthy for one of several different re 


series 


asons. They recognize, 
first, that good adjustment (and also bad adjustment) can be faked 
with at least partial success and that before 
interpret scores on an inventory some gu 
was not intentional faking. 


an attempt is made to 
arantec is needed that there 
They recognize also that quite uninten- 
tionally individuals differ in the severity of the 
they judge themselves and that some control is needed on this dif- 
ference in severity of standards. They recognize unwillingness to 
cooperate and inability to comprehend the task or to read the writ- 
ten items, which may show up as superficial and meaningless pat- 


terns of responses. All of these issues represent real problems to 
users of an inventory 


standards by which 


and the control scores represent one well- 
conceived attempt to identify untrustworthy answer sheets. 


In contrast with the Guilford-Zimmerman, we note that the 
MMPI: 


THE SHIPLEY PERSONAL INVENTORY 391 


1. Is based upon the distinctive responses of selected groups of 
persons—in this case, groups each presenting a particular psycho- 
pathology. 

2. Has scales that are defined by these abnormal groups. 

3. Is not concerned with the apparent meaning of an item, but 
only with whether it functions—whether it serves to differentiate 
between the abnormal and the control group. 


It thus follows the general pattern of the Strong Vocational Interest 
Blank. In common with the Guilford-Zimmerman, 


4. It permits any number of items to be endorsed, leaving the re- 
spondent free of constraint in this regard. 


Let us look now at an inventory that makes use of the forced-choice 
pattern of response. 


THE SHIPLEY PERSONAL INVENTORY 
The Shipley Personal Inventory consists of a series of forced 
choices. The examinee is confronted with a series of pairs such as: 


A. Worry about financial matters. B. Worry about people not liking 
me. 

A. Upset by the sight of blood. B. Upset by having to meet stran- 
gers. 


He is required to select the item of each pair that is more true of 
him, no matter how appropriate or inappropriate both members of 
the pair may be. When the items were put together in pairs an at- 
tempt was made to equate them for attractiveness or apparent social 
acceptability. At the same time, each pair was made up of items 
differing radically in their diagnostic significance. That is, one item 
in each pair was one that was chosen much more frequently by indi- 
adjusted, but this is not obvious to the 


viduals known to be poorl 
examinee who reads the item. We have here another attempt to 
escape the conscious faking or unconscious bias of the respondent. 
This instrument was developed primarily to serve as a psychiatric 
screening device in World War П and has been used almost entirely 
with military groups. It exists in a long form of 145 items and in a 
short form of 20 items selected from the long form because they were 
particularly discriminating. The short form was found to have 
about the sre reliability as the longer form, about 0.75 to 0.80 for 
normal military recruits. The test was moderately effective in iden- 
tifying men who were later discharged on the basis of a psychiatric 


392 SELF-REPORT INVENTORIES 


interview. Results for one group are summarized below. Score is 
number of maladjusted responses. 


Psychiatrically Psychiatric 


Score Approved Discharges 
10 or over 26 22 
5-9 187 21 
0-4 602 16 

Total 815 59 


An inventory such as this can be used as a preliminary screening 
device to pick out men for more intensive study, but the table above 
indicates that it could hardly be used as a sole basis for rejection. 
If a score of 5 was set as the critical score, it would pick up 43 of 59 
men (73 per cent) who would later be discharged, but this would be 
at the cost of 213 acceptable men. The cost is obviously too high. 

There are many other temperament and adjustment inventories. 
Some relatively well-known ones are listed in Appendix III, together 
with information about their characteristics and the groups for which 
they were designed. Most of those listed are designed for adolescents 
and adults. For reasons that we shall consider presently, inventories 
seem poorly adapted to use with school children. 


EVALUATION OF TEMPERAMENT AND ADJUSTMENT INVENTORIES 

How well can we hope to describe temperamental characteristics 
and personal adjustment through the individual's responses to à 
series of questions? The main issues to be raised concern the indi- 
vidual's frankness, his self-insight, and his reading ability. 

For personality inventories, frank and honest response by the 
examinee is an essential for a valid picture. In most cases, the 
general significance of the items is reasonably apparent to the reader. 
Most subjects can follow successfully instructions to fake in a par- 
ticular way. Even when the subject cannot fake successfully, if he 
tries to do so he will certainly give a distorted picture of himself. 
Inventory scores will only be useful when most respondents are an- 
swering in the way that they consider to represent themselves. The 
importance of providing protection against distortion is sufficiently 
great so that control scores have been introduced into the MMPI 
and certain other inventories. 

This means that personality or adjustment inventories cannot be 
used, or can be used only with caution, when the examinee feels 
threatened by the test or feels that it may be used against him. 
Inventories have not generally proved useful in an employment 


EVALUATION OF TEMPERAMENT AND ADJUSTMENT INVENTORIES 393 


situation, perhaps for this reason. One is inclined to doubt whether 
good cooperation can be expected from elementary-school pupils in 
the typical school setting. Even in high school and college, coopera- 
tion cannot be universally guaranteed. Generally speaking, in any 
practical situation we should consider an adjustment inventory to 
be no more than a preliminary screening device that will locate a 
group of individuals who may be having problems of adjustment or 
may be in conflict with their environment. Final evaluation should 
always await a more personal and intensive study of the individual. 

A second problem is that of self-insight. The individual may in- 
tend to report about himself frankly and truthfully, but it may be 
impossible to accept his responses at face value because the respon- 
dent does not have a truthful self-picture. In fact, the person whose 
adjustment is most unsatisfactory may be the one who is least able 
to face his own deficiencies. Studies have shown repeatedly that 
those who are rated low by their associates on some desirable trait 
over-rate themselves. Thus, the ill-tempered girl 


tend to gross 
often does not recognize or accept her own irascibility; the over- 
bearing boy may fail to acknowledge his boorishness. 

When inventories are built according to the pattern of the Strong 
and the MMPI, such a lack of self-insight may not matter. The 
significance of an item lies not in its obvious content but in the fact 
that it did actually distinguish between specified groups. If Henry 
marks that he would like to be an architect, he has behaved in the 
way engineers typically behave. The question as to whether engi- 
neers on the one hand or Henry on the other really want to be archi- 
tects is somewhat beside the point. The point is that they have 
both reacted to the question in the same way. On the other hand, 
where the score is taken at face value, as in the Guilford-Zimmerman 
inventory or the Kuder Preference Record, non-insightful responses 
will result in an untrue picture of the individual who makes them. 

A third problem in inventories of all sorts is that of reading load. 
In order to get sufficient scope and sufficient reliability, it usually 
proves necessary to include in the inventory several hundred items 
of some length and, in some cases, involving complexities of vocabu- 
lary and expression. For a poor reader, the labor of reading this 
Material may prove to be too much. He may be unable really to 
comprehend a number of items, or he may give up the attempt and 
respond in a superficial or random fashion. (The F scale of the 
MMPI was designed to protect against this hazard.) Thus, inven- 
tories are of questionable value for those of low literacy, be they 


adults or children. 


394 SELF-REPORT INVENTORIES 


Evidences of Validity. Those inventories that have been developed 
as measures of adjustment usually show a moderate level of con- 
current validity. That is, they differentiate between groups estab- 
lished on other grounds as differing in adjustment. Thus, the MM PI 
was set up to distinguish between diagnosed pathological groups and 
normals and continues to do so in new groups. Other inventories 
have been tested by their ability to differentiate less extreme groups 
and have stood up fairly well under the test. 

When it comes to predictive validity, the results are less encourag- 
ing. In civilian studies, "299 inventory scores have generally. failed 
to predict anything much about the future success of the individual 
either in school, on the job, or in his personal living. Military experi- 
ence? with these instruments has been somewhat more promising. 
There have been a number of studies showing substantial relation- 
ship between scores based on inventories and the subsequent judg- 
ment resulting from a psychiatric interview. Relationships to sub- 
sequent discharge from the service have also been sufficiently good 
to indicate that an inventory could serve a useful function as a de- 
vice to screen for careful interview those who appeared to be poten- 
tial misfits. 

The Practical Use of Adjustment Inventories. We must now ask 
what use should be made of adjustment inventories in and out of 
school. In the light of the factors that can distort score 
limited validity they have shown as predictors, 
that they should be used very sparingly. 
justment inventory should be used only a 
tensive psychological services, 
intensive study of some of the g 
sonnel, an inventory 


and the 


we must conclude 
Our feeling is that an ad- 
5 ап adjunct to more in- 
If facilities are available to permit 
roup by psychologically trained per- 
may serve as a means of identifying persons 
likely to profit from working with a counselor. However, there is 
little that a classroom teacher can do to dig behind and test the 
meaning of an inventory score. 
prove verv misleading. 


Accepted uncritically, the score may 
We believe that little useful purpose is 
served by giving an adjustment inventory 


and making the results 
available to the teacher, especially the teacher of an elementary- 
school child. 


ATTITUDE QUESTIONNAIRES 


One further type of self-report inventory deserve 
This is the attitude questionnaire, which is designe 


s brief mention. 
d to appraise an 
individual's favorableness toward some group, social institution, or 
social concept. Attitude questionnaires are primarily research tools, 


ATTITUDE QUESTIONNAIRES 395 


since appraisals of attitude rarely enter into routine measurement 
operations in the school, clinic, or industrial plant. They are impor- 
tant tools in research studies of factors related to attitude differen- 
tials, the types of experiences that produce changes in attitude, or 
the influence of attitudes on our perception of our world. 

The typical attitude questionnaire is made up of a series of state- 
ments which the individual may either endorse or reject. There are 
two main patterns: 


1. Scaled Statements. In this form, statements are scaled in terms 
of their meaning or significance, on the basis of extensive preliminary 
work. Thus, if we are preparing this type of attitude scale toward 
the United Nations, we start with a large pool of items. They 
may include the following: 


The UN is a strong influence for peace. 

The UN is a waste of time and effort. 

The UN does about as much harm as good. 

The UN is the most important force for good in the world today. 


А corps of judges is assembled and each judge is asked to sort these 
statements into a set of piles, each pile representing a different de- 
gree of favorableness toward the UN. The judge is not indicating 
his agreement or disagreement with the statement; he is giving his 
interpretation of its meaning and significance. Each statement re- 
based on the average of these judgments and an 
d upon the spread of the ratings. (The more 
the more ambiguous the statement is.) 
ed out, some 20 are chosen that spread 


ceives a scale value 
ambiguity index base 
the judgments spread out, 
From the pool of items tri 
out over the range of scale values and are relatively unambiguous. 
These constitute the attitude scale. 

When this type of attitude scale is administered, the respondent 
marks all the statements with which he agrees. His score is the 
values of the statements he endorses. 

In the other common format, the basic state- 
¿cept that neutral statements are avoided. 


average of the scale 
2. Summed Score. 


ments are much the same, ¢ 
Each statement is unequivocally either favorable or unfavorable. 


The respondent reacts to each statement on a five-point scale, rang- 
agreement to strong disagreement. Thus, a section 


ing from strong 
of a questionnaire in this format might read: 


The UN is a strong Strongly agree. Agree. Uncertain. Disagree. 
influence for peace. Strongly disagree. A А 
The UN will only Strongly agree. Agree. Uncertain. Disagree. 


make trouble. Strongly disagree. 


396 SELF-REPORT INVENTORIES 


The questionnaire is scored quite simply by giving 5 points for 
strong endorsement of a favorable statement, 4 points for agrec- 
ment, 3 points for uncertainty, and so forth. The scoring is reversed 
for the unfavorable statements. An individual's raw score is thc 
sum of his scores for the separate items. The raw score can, of 
course, be converted into a percentile or standard score if this seems 
desirable. 

Both forms of attitude scale are usually found to have satisfac- 
tory reliabilities, typically in the 0.80's. The two types of scales 
will vield scores that intercorrelate very highly, and for most prac- 
tical purposes there does not seem to be a great deal of choice be- 
tween them. The greater simplicity of preparation of a summed- 
score type of inventory will commend it to most persons who wish 
to use an attitude scale as an aspect of some type of research pro- 
ject. In either case, the scale will yield only a single general favor- 
ableness-unfavorableness Score for an attitude arca. Any qualita- 
tive variations within the broad arca are blurred. 
tions on attitude scale development have 


tifying more restricted and more homogeneous subscales within a 
larger attitude domain. A series of homogeneous subscales within a 
larger attitude area (toward the UN, for example) should permit 
mapping out in a more analytic and diagnostic wav the profile of an 
individual's or a group's attitudes, 

The big qualification about attitude 
purely on a verbal level, The 
back up his stated attitude. 
tudes rather than actions. 


Recent investiga- 
been concerned with iden- 


seales is that they operate 
individual doesn't do anything to 
The scales deal with verbalized. atti- 
Of course, an attitude scale is obviously 
fakeable. If we recognize that they repres 
that the individual is willing to express to us and work within that 


limitation, attitude scales appear to be a useful research tool or tool 
for experimental evaluation of educ 


the domain of knowledges and skills 


nt the verbalized attitude 


ational objectives lying outside 


SUMMARY AND EVALUATION 


In this chapter we have conside 
struments for studying personality 
essentially a standard set of inte 
form. 


red self-report inventories as in- 
^ An inventory of this sort is 
rview questions presented in written 


The individual's report about himse 


If has one outstanding advan- 
tage. It provides an ‘ 


'inside" view, based on all the individual's 


REFERENCES 397 


experience with and knowledge about himself. However, self-reports 
are limited by the individual's limited 


1. Willingness to reveal himself frankly. 
2. Self-insight and self-understanding. 
3. Ability to read the questions with understanding. 


One type of questionnaire that has proven valuable in selection 
and placement is the biographical data blank, in which the individ- 
ual provides factual information about his past history. Scoring 
keys developed for particular jobs have been found to have useful 
validity for several different jobs. 

Interest inventories provide satisfactorily reliable descriptions of 
interest patterns. These patterns persist with a good deal of sta- 
bility, at least after late adolescence, and appear to be significant 
factors for vocational planning. 

The validity of adjustment and temperament inventories is more 
open to question. Inventories of all types can be distorted to some 
extent if the individual is motivated to distort his responses. Thus, 
the integrity of the responses depends upon the motivation of the 
person examined. This depends, in turn, upon the setting in which 
and purposes for which the inventory is used. In school, industrial, 
or military use of adjustment inventories, one suspects that the 
motivations may often favor distorted responses. In any event, in- 
ventories of this type have not generally shown high validity. They 
should be used only with a good deal of circumspection. 

Attitude questionnaires have been developed to score the intensity 
of favorable or unfavorable reaction to some group, institution, or 
issue. Though these represent only verbal expressions of attitude, 


they are useful research tools. 


REFERENCES 
1. Ellis, A., Recent research with personality inventories, J. consult. Psy- 
chol., 1953, 17, 45-49. 
2. ‚ A., The validity of p 
43, 385-440. 
3, 1 
military practice, Psy 
4. Frandsen, A., Interests and ge 
Psychol., 1947, 31, 57-65. | 
5. Garry. R., Individual differences 1n 


JL . Psychol., 1953, 37. 33-37. : m 
usi E E. and R. P. Barthol, The validity of personality inven- 


tories in the selection of employees. J. appl. Psychol., 1953, 37, 18-20. 


ersonality questionnaires, Psychol. Bull., 1946, 


and H. S. Conrad, The validity of personality inventories in 
chol. Bull., 1948, 45, 385-426. 
neral educational development, J. appl. 


ability to fake vocational interests, 


398 SELF-REPORT INVENTORIES 


7. Longstaff, Н. P., Fakability of the Strong Interest Blank and the Kuder 
Preference Record, J. appl. Psychol., 1948, 32, 360-369. 

8. Mallinson, G. G., and W. M. Crumrine, An investigation of the stability 
of interests of high school students, J. educ. Res., 1952, 45, 360-383. 

9. Rosenberg. N., Stability and maturation of Kuder interest patterns 
during high school, Educ. psychol. Meas., 1953, 13, 449-458. 

10. Strong, E. K., Interest scores while in college of occupations engaged 
in 20 years later, Educ. psychol. Meas., 1951, 11, 335-348. 

11. Strong, E. K., Nineteen-year followup of engineer interests, J. appl. 
Psychol., 1952, 36, 65-74. 

12. Strong, E. K., Permanence of interest scores over 22 years, J. appl. 
Psychol., 1951, 35, 89-91. 

13. Strong, E. K., Vocational interests of men and women, Stanford, Calif., 
Stanford University Press, 1943. 

14. Super, D. E., The Bernreuter Personality Inventory: a review of re- 
search, Psychol. Bull., 1942, 39, 94-125. 


SUGGESTED ADDITIONAL READING 


Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ters 20 and 21. 
Ellis, Albert, Recent research with personality inventories, J. consult. Psy- 
chol., 1953, 17, 45-49, 
Ellis, Albert, The validity of personality questionnaires, Psychol. Bull. 
1946, 43, 385-440. i 
Ferguson, Leonard W., Personality measurement, New York, McGraw-Hill, 
1952, Chapters 2-9, 
4 Harsh, C. M., and H. G. Schrickel, Personality development and assessment. 
New York. Ronald Press, 1950, Chapters 17 and 18. 
Maller, Julius B., Personality tests in J. McV. Hunt, Editor, Personality 
and the behavior disorders, New York, Ronald Press 1944, Chapter 5. 


Super, Donald E Appraising vocational Fitness, N ь 940 
MT sin, Fitness, New York, Harper, 1949. 
Chapters 16-19, ess, New York, Harpe 


QUESTIONS FOR DISCUSSION 


de How satisfactory is the method that was used in validating the Strong 
Vocational Interest Blank? What limitations do the procedures have? In 
what ways should they be checked? 

2 Mat are s relative adta J 4 

2. W hat are the relative advantages of the Strong Vocational Interest Blank 
and the Kuder Preference Record? Under what circumstances would you 
choose to use one and under what circumstances the other? 

3. i hat 18 the relationship between measures of interest and measures of 
2 m 2 T3 ig gi rt [ 
ability? What does this Suggest as to the ways in which the two types of 
tests should be used? Е 

4. Most civilian studies have [айе 
tories ve 


| d to find interest or adjustment inven- 

ries very useful in personnel selection. What are the reasons for this? 

5. Why are most published interest inve ntories intended for use with second- 
‘hool pupils, college students, and adults rather th 

students? 


an elementary-school 


QUESTIONS FOR DISCUSSION 399 


6. What uses could a classroom teacher make of results on the Kuder In- 
terest Inventory other than in giving vocational and educational guidance? 

7. In what ways could a biographical data blank help а teacher in under- 
standing the pupils in a class? What types of information would be useful to 
include on such a blank? 

8. What conditions must be met if a self-report inventory is to be filled out 
accurately and give meaningful results? 

9. How much trust can we place in adjustment inventories given in school 
to elementary-school children? What factors limit their value? 

10. What important differences do you notice between the. Guilford-Zim- 
merman Temperament Survey and the Minnesota Multiphasic Personality In- 
ventory? For what purposes would each be more suitable? 

11. What purposes are served by the control scales (L. К, F, ?) on the 
Minnesota. Multiphase Personality Inventory? What would be the comparable 
issues in personality rating scales? How might one adapt the ideas of control 
scales to ratings? 

12. What factors limit the usefulness of paper-and-pencil attitude scales? 
What other methods might a teacher use to evaluate attitudes? 

13. Prepare the rough draft for a brief attitude scale to measure teachers’ 
attitudes towards objective tests. : х 

14. With what kinds of groups can adjustment inventories be used most 


satisfactorily? 


Chapter 15 
Projective Tests 


In the last three chapters, we have considered the possibilities of 
studving a person through (1) his observable actions, (2) the impres- 
sion he makes on others, and (3) what he tells us about himself. 
There is one further avenue of approach that we must now examine. 
We may be able to learn about the individual by exploring his world 
of fantasy and make-believe, We shall do this by providing him 
with relatively indefinite and unstructured stimuli and observing 
how he structures them for us. The various techniques for doing 
this may be collectively identified as expressive and projective tech- 
niques. 

Psychologists have long recognized that the perceiving of even 
quite definite stimuli—an accident, a scene staged before a class. 
the content of a picture—is dependent upon the individual per- 
ceiver. He sees what he is set to see. The report reflects his readi- 
nesses and predispositions. The vaguer the stimulus, the more op- 
portunity there is for the individual to project himself into the 
report. Projective tests take advantage of this situation. They 
operate with quite unstructured materials: a vague and ambiguous 
picture, an ink blot, a word or two of a sentence, some modeling clay, 
or paper and finger paints. Furthermore, the instructions place very 
little restriction or constraint upon the respondent. Under these 
conditions, there is the greatest diversity of product produced. The 


basic assumption of projective methods is that under these circum- 
stances the production depends i 


sonality factors in the person be 
analysis of the productions can 
The basic procedure 


n large measure upon basic per- 
ing tested and that an appropriate 
reveal that personality structure. 
common to projective techniques is, then, 


1. to present the subject with 


à series of fluid, weakly structured 
stimuli 


2. under instructions that emphasize 
400 


freedom of response, and 


THE RORSCHACH TEST 401 


3. to analyze his productions for insight into his basic personality 
dynamics. 


For at least some of the projective media, materials and pro- 
cedures have been standardized and are widely used. In addition, 
there are a host of exploratory and unstandardized projective media. 
We shall describe in some detail the two that are probably most 
extensively. used, mention briefly several others, and then try to 
apply to projective tests the same criteria of evaluation that we 
have used with other measurement procedures. 


THE RORSCHACH TEST 


The Rorschach Test has been so widely publicized now that it is 
probably familiar in a general way to most readers of this book. 
The basic material is 10 ink blots, nonsense patterns produced by 
putting blobs of ink on a piece of paper and folding the paper over 
so the two halves blot. But these are not just ordinary ink blots. 
They were selected by Hermann Rorschach, the original investiga- 
t thousands of different blots with patients in 


tor, after trying ou 
appeared particularly effective in 


Mental institutions, because they 
eliciting a richness of diagnostic material. 

Sample ink blots, like those in the Rorschach Test, are shown in 
Fig. 15.1. These blots are entirely black and white, as are five of 
the blots in the Rorschach series. Two Rorschach blots contain bright 
red blotches in addition to the black and white, while three are made 
Up only of colored patches of various hues. The symmetrical blots 
are mounted in the center of white cards and may be turned and 
cards are presented to the subject one 

The order of presentation is con- 
subject's reaction to the sudden ap- 
c a significant element in his reac- 


Viewed from any angle. The 
at a time in a specified order. 
sidered important because the 
pearance of color is thought to b 
tion to the test materials. 


The test is introduced to the subject in 
rent things in these cards. I'd like you 


hen the subject is seated, the examiner 
hands him card I with instructions: “Tell me what you see? What 
might this be?“ The subject is allowed as much time as he wants 
tted to give as many responses to it as 
he wishes. He is also allowed to turn the card around and look at 
it from any angle to find things init. However, he is not instructed 
to look for many items and is not told that he may rotate the card. 


a rather ambiguous way— 


"People sce all sorts of diffe 
to tell me what you see.” W 


for a given card and is permi 


402 PROJECTIVE TESTS 


Fig. 15.1. Rorschach type ink blots, 


Instructions are kept to the barest minimum, this presumably mak- 
ing the performance depend more completely on the person being 
examined. 

During the initial presentation, the 
time between presentation ofe 
He records each response 
when the response is seen 


examiner keeps a record of the 
ach card and the initial response to it. 
as it is given and the position of the card 


- Notes are also made of any significant 
behavior by the subject during testing, i.e. 


‚ evidences of upset, rejec- 
tion of a card, etc. 


SCORING A RORSCHACH RECORD 403 


After the initial presentation of the cards, the examiner goes over 
them with the subject again, questioning him about his responses. 
The questioning helps the examiner to find out where each item 
was seen in the blot and what aspect of the blot (form, color, etc.) 
primarily determined what the subject saw. It gives an opportunity 
for further clarification of anything that may have been obscure in 
the subject's initial response. Notes are made as needed and become 
part of the raw test record. 


SCORING A RORSCHACH RECORD 

The raw Rorschach record contains a mass of diverse material, 
and procedures of analysis must be applied to bring some order out 
of the chaos of details. Several different scoring schemes for the 
Rorschach have been developed. The scoring procedure described 
here is the one developed by Klopfer. Only the major parts of the 
Scoring procedure are discussed in this section. 

Rorschach and his followers have identified a number of different 
Categories of response which are thought to have diagnostic sig- 
nificance. In addition to the simple count of number of responses, 
three main aspects of cach response are considered important. These 
are location, determinant, and content. 

In general, location is scored by determining the area of the blot 
to which the response corresponded. The subject can use the whole 
blot (W) for his response, as when he calls blot 1 of Fig. 15.1 a “‘crab's 


Shell," He тау base his response on à large subdivision of the blot 
f of blot 2 to be “ап Indian's head.” 


some small usual detail (d), as 
2 is seen as two witches talk- 


(D), as when he reports each hal 
He may base his response on 
When, the upper center part of blot 
ing. 

The determinant re 
caused the subject to see it 
аге the shape or form of the 


fers to the characteristic of the blot that 
as he did. The principal determinants 
blot (F), color (C), movement (M) 
and shading (К). Asa rule, the greatest number of responses ina 
record are elicited by the shape of the blot. A further coding may 
be assigned to form responses depending upon whether the form of 
the blot appears to fit the response especially well (Е+) or quite 
poorly (F—). It is quite com a response based jointly 
Thus, a response may depend upon both the 
or the shape and shading of the blot. 
flection of a bear in the water. 
rminants, the dominant 
made primarily be- 


mon to have 


9n two determinants. 
Shape and color of the blot, 


Thus, a subject may see blot 3 as a re 
coded for both dete 


The response would be 
if this response was 


One being listed first (LE, 


404 PROJECTIVE TESTS 


cause of the shape of the blot, it would be coded FK, F for form and 
K for the depth or vista response). 

The content categories refer to what it is that is seen in the blot. 
Among the categories used are human beings, animals, parts of hu- 
man beings, parts of animals, nature, and inanimate objects. Re- 
current content themes are particularly noted, as are content ele- 
ments that appear to tell а story. 

In addition to the three aspects of analysis noted above, each 
response is also classified as a frequently occurring or popular re- 
sponse (P) or as a rarely given or original response (O). There are 
a number of additional categories which are noted in the analysis of 
the record, i.e., use of white space, rare edge details, etc. It is not 
possible or desirable to try to indicate the complete scoring pro- 
cedure here. 

After the single responses have been coded, a summary tabulation 
is prepared for the record. The frequency of each category is deter- 
mined, and a number of ratios between different categories are cal- 
culated. It is this quantitative summary, plus the qualitative notes 
on the subject's reactions, on which interpretation is primarily based. 


The single response has significance only as it becomes part of this 
total. 


INTERPRETING THE RORSCHACH RECORD 
Rorschach specialists would 


agree that the heart of the Rorschach 
method is the final inte: 


grative synthesis of the material that results 
from scoring the test. This is also the most difficult part of the 
undertaking, calling as it does for the ev. 


many separate cues. Writers about the 
quate interpretations can only be made by persons who have both a 
broad psychological background and extensive 
instrument itself, Just how much tr 
ably be a matter of debate, but clearly the interpretation of a record 
is not something to be undertaken by the teacher, the usual guid- 
ance worker, or many psychologists without the required special 
training, Any abbreviated presentation of the manner of interpre- 


tation must necessarily represent an Oversimplification and do the 
method some injustice, 


The Rorschach record is considered by its e 
information about the whole functioning 
questions we may ask and for which we m 
following: How does this 
characteristically first Jo 


aluation and synthesis of 
Rorschach insist that ade- 


experience with the 
aining is required would prob- 


xponents to provide 
Personality. The types of 
ay expect help are like the 
person usually attack a problem? Does he 
ok at the problem as a whole and then 


INTERPRETING THE RORSCHACH RECORD 405 


break it down into component parts or does he build up the total 
solution from its main parts? Does he deal with the main features 
of a problem or does he bog down in details and never reach the main 
problem? Does he approach problems in a rigid, set manner, or is 
his approach flexible? What is his intellectual level and how effec- 
tively does he use his intelligence? Is he overly ambitious? How 
does he handle his emotions? 

Factors that are considered to be associated with the intellectual 
level of the subject are clearness of perception of form, ability to 
organize the blot into forms using the whole of the blot, number of 
original responses, total number of responses, and variety of content. 

The location of the response in the blot and the approach in re- 
sponding to cach blot are said to represent the individual’s way of 
solving problems. Using the whole blot is associated with intellec- 
tial ambition, and a person who is striving beyond his ability is 
expected to produce poor whole responses (i.e., to force a high level 
of organization even where it is not appropriate). Breaking the blot 
into small, unusual details seems to be characteristic of compulsive 
onse must exactly match the form of 


people who insist that the r 
the blot. The common-sense approach is illustrated by the frequent 
use of D, or usual details. Exponents of the Rorschach consider it 
to be particularly effective in revealing how well the individual uses 
r, estimates of intelligence based on the 
ather modest correlations with scores on 


his intelligence. Howe 
Rorschach have shown only r: 
the conventional intelligence test. 

The subject's use of color, texture, and shading are thought to 
give evidence about his emotional life. Pure color or color naming 
responses are considered to indicate a lack of emotional control. 
When color is combined with form but the form predominates, it is 
taken to indicate that the individual has a lively emotional life but 


that he has Texture and shading responses 


are usually interpre 
quacy, or depression. Vista or 
much the same interpretation. 

The movement (M) responses 
Rorschach expressed. the belief that movement re- 
sponses represent a strongly felt wish experience. Many inter- 
preters feel that M is a correlate of the color (C) response and shows 
internalized emotion whereas the C response represents the external- 
ized emotional reaction. 

The content of the Rorschach т 
for other types of responses. In 


control of his emotions. 
ted as indications of anxiety, feelings of inade- 
three-dimensional responses receive 


are associated with the inner life of 


the individual. 


esponses provides cues to the cause 
the different content of the re- 


406 PROJECTIVE TESTS 


sponses, the subject reveals his different personal experiences. It is 
from the content or symbolism of the content that the analogy is 
made between a Rorschach record and a dream. 


THE THEMATIC APPERCEPTION TEST 


The Thematic A pperception Test, usually referred to as the TAT, 
was originally described by Murray and Morgan" in 1935. The 


Fig. 15.2. Sample picture from Murray Thematic Apperception Test. 
basic material of the TAT is a set of pictures, 30 in all, each rather 
vague and indefinite, showing one or two human figures in different 
poses and actions. Some of the pictures are specifically for boys, 
some specifically for girls, some for males over 14, some for females 
over 14, and some for all groups. For a particular age and sex, there 
are 20 of the pictures that are Supposed to be used, though in many 
cases the examiner limits himself to a smaller number of pictures 
that he considers particularly appropriate for his subject. A sample 
picture from Murray's Thematic Apperception Test is shown in Fig. 
15.2. This particular picture is one used in the series for women. 
The subject's basic task is to tell a story based on the picture. 
Before any of the pictures are shown to the subject, the examiner 


SCORING AND INTERPRETATION OF THE TAT 407 


instructs him somewhat as follows: “I am going to show you some 
pictures. I want you to tell me a story about what is going on in 
each picture. What led up to it, and what will the outcome be?" 
The exact instructions may vary from time to time, but they always 
include the directions to produce a setting for the action in the 
picture and to indicate the outcome. The story told by the subject 
is recorded verbatim, either by the examiner or with a recording 
machine. There are no time limits and no limits on the length of 
the story. The example given below is a sample of the responses 
given to the picture in Fig. 15.2. 

This young girl wants to go out on her own and lead a good life, but 
this old woman wants to control her and make her do things as the old 
woman wants them done. Some of the things the old woman has told 
the girl were bad, but the girl had to do them anyway. She hates the old 
hag and gets tired of the control that the old woman has over her and 
kills the old hag. No one ever found out that the girl killed the woman 


so she is free to do what she wants. 


Points that should probably be noted in the above story are sub- 
mission to another, an unwillingness to assume responsibility for 
personal behavior, hostility, and a socially unacceptable method of 
solving the problem situation. It must be remembered that a single 
story is not especially significant in understanding the person. 
However, if these elements recurred in a number of stories, then 
the pattern would have much greater significance. 


SCORING AND INTERPRETATION OF THE TAT 

A number of different scoring schemes have been worked out for 
the TAT. Most of these are elaborate and time-consuming. One 
thing that all the scoring schemes have in common is that the con- 
tent of the stories plays a central role in interpreting the record. 
This contrasts with the Rorschach, where the center of attention is 
not what the subject sees but how he sees it. Beyond this, there is 
little uniformity in procedure for analysis, the method of interpreta- 


tion and aspects analyzed depending upon the original purpose of 


giving the test. 

Originally Murray analyzed the storie according to needs and 
presses, the needs of the hero and the environmental forces (presses) 
to which he is exposed. Each story was analyzed; from the total set 
of stories each need and press received a weighted score; and the 
s were then arranged in rank order. At the same 


needs and presse: 
time, the relationships be 
though this type of analysis appeare 


tween the needs were investigated. Al- 
xd to yield a wealth of data, it is 


408 PROJECTIVE TESTS 


not generally followed today. Mastery of the need concept is diffi- 
cult to achieve, and the analysis is quite laborious, requiring about 
5 hours to interpret a set of 20 stories, on the average. 

Most currently used scoring systems take account of the following: 


1. The style of the story, including such factors as length of story, 
language used, originality, variation of content, and organizational 
qualities. 

2. Recurring themes in the story: such themes as retribution, 
struggle and failure, parental domination, etc. 

3. Relation of the outcome of the story to the rest of the plot. 

4. Primary and secondary identification, the choice of hero for 
the story and person second in importance. 

5. The handling of authority figures and sex relationships. 


Whatever method of interpretation is used, it is recognized that 
the single response has significance only as an element in the total 


pattern. It is the recurring themes and features that are important 
for interpretation. 


OTHER PROJECTIVE TECHNIQUES 

During the past 20 years, a host of othe 
have been proposed and have been de 
degree. A number of these bear 


r projective procedures 
veloped to a greater ог lesser 
a close resemblance to the TAT. 
The Four-Picture Test of Van Lennep requires the subject to use 
four vague water-color pictures involving persons in different group- 
ing and relationships in composing a story. This is alleged to bring 
out the subject's attitude toward life. The Schneidman Make а Pic- 
ture Story Test (MAPS) requires the subject to make his own pic- 
ture and corresponding story from a set of 67 cardboard figures pre- 
sented one at a time by the examiner, | 
tion by the subject is reported to result in longer and richer stories 
than the TAT, but the unstandardized nature of the task makes it 
difficult to arrive at any norms or any standard way of scoring the 
product. Symonds has prepared the Picture Story Test, a set of pic- 
tures involving adolescent characters, designed for use with adoles- 
cents. Special sets of pictures have also been prepared for use with 
children and for use with Negroes, using figures with which the re- 
spondents can more readily identify themselves. 

Graphic and plastic art materials have also been used to provide 
the raw material for projective analyses. 
finger painting and various types of clay 


The more active participa- 


Children's painting and 
modeling have provided 


OTHER PROJECTIVE TECHNIQUES 409 


unstructured media into which the child could project himself. Doll 
play also has provided an opportunity for dramatic expression for 
young children. The child is provided a set of dolls representing 
the various members of a family constellation and is given the ma- 
terials with which to construct a stage setting. He is encouraged to 
act out any type of story or scene that appeals to him. Acting out 
problem situations in the make-believe setting is used not only as a 
source of information about the child's problem, but also as a form 
of therapy through which the child is provided with an opportunity 
to express, and presumably eventually relieve, his anxieties and 
tensions. 

Verbal materials have also been used to some extent as media to 
elicit the individual's projections. The classical form of verbal pro- 
jective test is the word association test, in which stimulus words are 
read one after another and the subject responds to cach with the first 
word he thinks of. Cues to problem areas are obtained from words 
to which the subject responds very slowly, words on which he blocks 
and makes no response, and words to which he responds with unusual 
associations. The word association procedure is not widely used at 
the present time because it does not appear to provide any very 
rich insights into the person being studied. 

Sentence completion is another form of verbal test that has re- 
ceived some attention since 1940. The subject is given a series of 
incomplete sentences. The beginning of cach sentence is provided, 
and the subject is to go through the list quickly, writing an ending 
for each sentence. The sentences may be in first person or third 
person, very unstructured or quite complete. Illustrations of the 


sort of materials used are the following: 


I wish 
John felt __—— ی‎ 
When I am alone I — — — 
When Mary's mother left she ер co c u^ 


The sentence completion test is usually analyzed for the content 
of the responses in much the same way as the TAT and gives indi- 
cations of feelings, attitudes, and reactions to things and people 
rather than indicating underlying personality structure and dy- 
The case with which it can be administered to a group 
but the verbal production required limits its 
use to fairly literate individuals. The nature of the response that 
he is making is fairly apparent to the subject, and it is relatively 
ıl himself on the test if he does not choose to. 


nami 
makes it attractive, 


easy for him not to reves 


410 PROJECTIVE TESTS 


THE ESSENTIAL NATURE AND PRESUMED 
ADVANTAGES OF PROJECTIVE TESTS 


We have seen something of the diversity of projective methods 
and of the wide range of materials and media they use. Now we 
must ask what the common core running through them is and deter- 
mine what advantages they may claim over other methods of study- 
ing the individual. Four points will be noted, 

In the first place, the tasks presented to the individual are usually 
both somewhat novel and quite unstructured. The subject cannot 
depend upon established, conventional, and stereotyped patterns of 
response. Rather, he is thrown back upon himself and must delve 
within himself for the response. He must create it anew in the test 
situation, 

In the second place, the nature of the appraisals being made is 
usually well disguised. The subject is ordinarily not aware of the 
true purpose of the test, and even if he does have a general idea of 
the nature of the appraisal he does not know what aspects of his 
response are significant or what significance they have. The test is 
usually given under à neutral guise as one of imagination or artistic 
ability. The individual is not called upon to verbalize his anxieties 
or emotions or to reveal himself directly and consciously to the 
tester, , What revelation occurs is largely indirect and outside the 
subject's awareness, Thus, inhibitions and conscious controls may 
be by-passed, and intentional distortion 
difficult. 


Third, most of the tests make little or no demand on literacy or 


academic skills. They are non-reading, largely independent of any 
particular language, and in some 


This extends greatly their scope 


of the picture presented is 


cases do not involve speech at all. 
i 8 of usefulness. They may be used 
with children, even quite young children below school age. They 
may be used with illiterates or non-English speaking, They may be 
used in different cultures, Thus, their scope is much wider than 
self-report or rating procedures. 


Fourth, they provide a view of the total functioning individual. 
They do not slice off one piece or trait for analysis. "They preserve, 
it is alleged, the unity and integration of the total personality To 
the clinically oriented user this appears а great virtue : hi practical 
work, one must deal with the whole person, not ius his limited 
intelligence, or his lack of emotional control, 


intel i : Or his strong identi- 
heation with his father. There is an appe 1 


al to a test that aspires 


VALIDITY OF PROJECTIVE TECHNIQUES 411 


to appraise the individual as a total functioning unity. Whether 
this is, in fact, the best way to understand him is another matter. 
We may be seduced by the shibboleth of "the whole child" into 
vagueness and fuzziness that results in a poorer picture of the whole 
than if we had looked more analytically and carefully at one aspect 
at a time. In buying a house, there are considerations of basic con- 
struction, type of roof, size of rooms, quality and adequacy of heat- 
ing plant, plumbing and electrical equipment, accessibility and qual- 
ity of schools and shopping areas, character of the neighborhood, 
esthetic appeal of the structure, cost, and many others. A more ra- 
tional choice of house could probably be made by identifying these 
components and considering them one at a time than by reacting to 
the proposition as a totality. In the same way, it is possible that 
the total individual may be apprai 


ed more accurately and a truer 
description of him prepared if we concentrate our information- 
gathering upon one aspect at a time. It is, in any event, a funda- 
Mental issue and one on which no agreement is currently available 
as to how analytical an approach will provide the best basis for view- 
ing the whole. 


EVALUATION OF PROJECTIVE METHODS 


We must now attempt to evaluate projective methods in terms of 
the criteria which we set forth in Chapter 6, attempting to see how 
they meet our requirements of validity, reliability, and practicality. 
Most of our data and illustrations will refer to the Rorschach, since 
this test has been studied longer and much more intensively than 
the others. The general problems we encounter will apply pretty 
generally to all. However, information on many of the varied tech- 
niques is fragmentary in the extreme. 


VALIDITY OF PROJECTIVE TECHNIQUES 


Determining the criteria by which to appr 
results from using projective techniques is 
lem 


aise the validity of the 
a very troublesome prob- 
Some of the techniques at least, for example, the Rorschach, 
undertake to provide the basis for a rather full desc 
functioning person. 
picture. 


ription of the 
How shall we determine whether this is a true 


You will remember than in Chapter 6 we discussed five 
Sorts of validity: (1) content validity, (2) construct validity, (3) 
&ruent validity, (4) concurrent validity, and (5) predictive validity. 
Let us consider projective methods from each of the à 
points and see what may be said. 


con- 


se five view- 


412 PROJECTIVE TESTS 


Content Validity. The judgment of content validity depends upon 
a judgment of the correspondence between the overt content of a 
test and the qualities one desires to measure. The overt content of 
projective tests bears little or no direct relationship to the inferences 
that are made from the tests. Thus, there is no discernible relation- 
ship between the composition of a set of ink blots and inferences 
about the effective intelligence or emotional control of a person. 
Content validity appears to be irrelevant as a basis for appraising 
the validity of a projective test. 

Construct Validity. The term construct validity was used to refer 
to the adequacy of analysis of some broad construct, such as scien- 
tific thinking, into spec 


ie test tasks. Here again, in the case of 
projective tests the lack of correspondence between the overt tasks 
presented to the subject and the inferences one wishes to make renders 
this type of rational analysis and matching of task to inference a 
very devious and hazardous proce: 


One can to some extent ra- 
tionalize certain behaviors in perceiving blots, in telling stories, or in 
painting pictures as corresponding to per: 


sonality trends in the sub- 
ject. But the analysis results in rather remote and crude analogies 
(i.e, movement does not exist in a blot; therefore seeing it is a 
creative act), and these cannot serve as demonstrations of their own 
correctness. 
Congruent Validity. 
validity when it corre 
tion or complex that i 


We said that a measure shows congruent 
sponds to an accepted evaluation of the func- 
t is supposed to appraise, or when it is affected 
in a predicted manner by changed experimental conditions. Much 
of the existing evidence for the validity of projective tests falls into 
one or the other of these two patterns. Let us consider first evidence 
from the correspondence of projective tests with other sorts of evalua- 
tions. 

Evaluating projective tests by trying 
of appraisals is а somewhat 
There are several problems. The first is to find some other satis- 
factory evaluation of the total functioning personality. Analytical 
trait scores do not seem acceptable. A descriptive personality sketch 
prepared either by the person himself or by his ; 
ably too biased and superficial to se 5 


to relate them to other sorts 
unsatisfactory undertaking at best. 


ociates is prob- 
rve as an acceptable criterion 
for evaluating the test. Probably the best outside evaluation is the 
personality description based upon extensive study of the individual 
by psychiatrist, clinically trained psychologist, or social worker. 

A second problem is what it is that is being validated. In this 
setting, we are probably primarily interested in v 


alidating the final 


VALIDITY OF PROJECTIVE TECHNIQUES 413 


descriptive picture that results from the test, rather than any quan- 
titative sign or score. If this is true, it must be recognized that it is 
inevitable that we validate the test-and-interpreter as a unit. Fail- 
ure to find a procedure valid may be due to lack of skill on the part 
of the interpreter, and conversely the finding that a certain especially 
skilled interpreter can arrive at valid descriptions on the basis of the 
test is no guarantee that the test will maintain its validity in less 
skilled hands. 

Finally, we face the problem of suitable experimental and statis- 
tical operations for evaluating validity. The conventional correla- 
tion coefficient breaks down when the problem is one of assessing 
the validity of a multidimensional personality description. The ex- 
pedient that has been resorted to in a number of investigations is 
that of matching. Thus, case-study materials are provided for sev- 
‚ and projective test records or 


eral individuals, perhaps five or six 
interpretations are supplied for the same persons. Judges are given 
the task of matching the two sets of materials, i.e., of saying which 
case record goes with which projective test record or interpretation. 
If matching better than chance is achieved, it indicates at least 
that there is something in the projective record that can be identi- 
fied with something in the case material. The matching procedure 
does not, of course, tell what the significant cues were in either the 
test data or the case record. 

Published studies that have attempted the type of matching de- 
scribed above have not been numerous. However, in certain in- 
stances at least matching has been quite successful. Thus, Krug- 
man " reports a study in which personality descriptions based on 
the Rorschach and abstracts of the clinical charts of 25 cases of 
problem children were matched in sets of five by five different judges. 
On the average, each judge achieved correct matching in 84 per 
cent of the cases, as compared with the 20 per cent that could be 
expected by chance.* (It should perhaps be pointed out that the 
group in which this matching was achieved ranged in age from 5 to 
18 and in intelligence from borderline to very superior and that they 
were all sufficiently unusual to be problem cases.) 

The other type оѓ approach to congruent validity is through sys- 
tematic introduction of an experimental variable and observation of 
Whether the expected relationships with or effects upon test per- 
formance are obtained. Thus, Williams ® studied the decrement in 
performance on a substitution task under stress conditions and found 
f —0.72 with a Rorschach index presumed to measure 


à correlation o 
per cent on colored cards). On the other 


intellectual control (F+ 


414 PROJECTIVE TESTS 


hand, Eichler’ studied the effect of introducing stress (electric 
shocks) during a test task upon 15 different anxiety indicators of 
the Rorschach. Of the 15, four showed significant difference in the 
expected direction (in comparison with a control group), one a sig- 
nificant and three a ncar-significant difference in the reverse direc- 
tion. Lazarus" investigated "color shock" on the Rorschach (delav, 
blocking, and apparent disturbance when faced with the colored 
cards) bv comparing the standard Rorschach with an achromatic 
version of the cards. The indices of color shock appeared about 
equally on the colored and achromatic series. These examples serve 
as illustrations that sometimes scores on the test do correspond in 
the expected way with outside variables, but in many other instances 
they do not.* Experimental verification of the components of Ror- 
schach doctrine is far from complete. 

Concurrent Validity. We used the term concurrent validity to 
refer to a measure's ability to differentiate groups that were known 
to be different on some basis. It is this type of analysis that has 
been relied upon in large measure both as the means of establishing 
the scoring procedures and as the test of the effectiveness of projec- 
tive methods. Groups with different | 


hiatric diagnoses or clinic 
groups and presumed "normal" 


groups are compared either in terms 
of individual scores or signs or in terms of the total de 


scription of the 
individual. That is, one may carry out statistical tests of single 
indicators, or one may attempt a blind assingment of cases to diag- 
nostic categories on the basis of the total record. 

Much of the matching up of signs and de riptions with diagnoses 
has been intuitive and descriptive and has been carried out without 
the experimental controls that would make the work suitable as a 
lest of the hypotheses relating to the instrument. However, a cer- 
tain number of adequately controlled studies have been carried out. 
Benjamin and Ebaugh ? report agreement in 85 per cent of 46 cases 
in which blind diagnoses were arrived at on the basis of Rorschach 
alone and checked against independent diagnoses by a psychiatrist. 
Siegel " reported 62 per cent agreement on an initial test and 88 per 
cent agreement on a retest a vear later betwe 
chiatric diagnosis for 26 children referred to 
Thus, in at least certain groups represe 
tions from normality, the Rorschach 
gories with substantial accuracy. 

However, positive results are not uniformly obtained. In one 
careful study, two experienced Rorschach analysts tried to sort out 
the records of 60 neurotic boys from those of a control group that 


en Rorschach and psy- 
a child guidance clinic. 
nting fairly extreme devia- 
has identified diagnostic cate- 


RELIABILITY 415 


matched the neurotics for age and intelligence. Classifying their 
results on a two-way split, one judge was right 63 per cent of the 
time and the other judge 48 per cent of the time. Flipping a coin 
should vield 50 per cent correct. 

Predictive Validity. & number of studies have attempted to pre- 
dict from projective test data some future event, such as continua- 
tion in or improvement from therapy, academic achievement, or 
job success. In isolated instances, correlations of some magnitude 
have been reported, but other workers have generally found it im- 
possible to reproduce these results. Thus, Munroe 13 reported a 
correlation of —0.49 between number of signs of maladjustment de- 
rived from an inspection technique of scoring the Rorschach and 
freshman grades at Sarah Lawrence College. However, Cronbach? 
was unable to achieve any useful prediction with University of Chi- 
cago freshmen, and it may be noted that Munroe has not reported 
confirmation of her original results with new groups. Holt and 
Luborsky 7 reported preliminary results indicating correlations above 
0.60 between scores derived from the TAT and a rating of psycho- 
therapeutic ability in psychiatric candidates but found that the cor- 
relation dropped to approximately zero when more data were gath- 
ered. In general, we must conclude that at the present time we lack 
verified evidence that any projective test enables us to predict any- 
thing of importance about the individual. 


RELIABILITY 
of 


ate 


Arriving at a satisfactory basis for appraising the reliability 
projective methods has also proved to be a tricky business. Sepa 
equivalent or near-equivalent forms of the tests have rarely. been 
prepared. The devotees of the tests protest that it is not possible 
to divide the test into equivalent halves. Memory of a previous test- 
ing is likely to distort retesting over a short time. And the test en- 
thusiast is likely to protest that the total personality is changing 
from day to day, so that test resul Кё à 

stable over a period of time. One senses a certain flight from reality 
in all of this, a well-formed mechanism of defense. The tough- 
minded psychometrician would like to know what magic there is 
about one set of ink blots or one set of pictures that makes them 
irreplaceable. Why should it be essentially more difficult to produce 
a parallel set of blots than a parallel set of intelligence test items? 
He would also contend that aspects of personality so fleeting that 
they cannot legitimately be appraised by a retest after some lapse 
of time are probably also so superficial that they are of no importance. 


ts cannot be expected to remain 


416 PROJECTIVE TESTS 


But real problems in evaluating the reliability of projective 
methods do remain. "These center around the question of what it is 
whose reliability is being appraised. Isit the reliability of some rela- 
tively objective component score, or is it the reliability of some in- 
ference from the test materials, or is it the reliability of the total 
descriptive picture? Probably each of these is worth studying. 
The last, the reliability of the descriptive picture, does not fit into 
the usual statistics of reliability, and one must fall back on matching 
or similar techniques. Let us inquire into reliability at each of the 
several levels. 

Reliability of Single Score Components. Evidence on the reliability 
of single score components is available chiefly for the Rorschach. А 
number of split-half and retest reliability studies are available. 
Values reported differ from study to study and for the different 
types of subscores. However, reliabilities are quite uniformly posi- 
tive and, in many instances, quite substantial. One would conclude 
that the major score components show a within-test consistency 
that is comparable with that of other personality measures. 

Perhaps the most satisfying evidence of reliability is correlation of 
two comparable or near-comparable forms. A separate set of blots, 
the Behn-Rorschach, was designed to parallel the original Rorschach 
Test. Correlations between the two were determined by Eichler." 
For total number of responses the correlation 1 


| i was about 0.70. 
Separate scoring categories showed correlations of from 0.45 to 0.70. 
The values were not greatly different when a retest was given with 
the original Rorschach. The relationships were all significantly posi- 
tive but lower than we have come to expect with other test mate- 
rials, particularly tests that are to be used for detailed individual 
diagnosis. Comparable figures were obtained by Meadows © for the 
main score components. However, Meadows found many of the 
rarer and subtler score н 
0.40 down to zero. 
Reliability of Predictive Inferences. 
components of a projective 
reliability, and yet that consi 


5 to give reliability coefficients ranging from 


It is possible that the single 
test record may be of rather modest 
stent inferences may be drawn from the 
total record. If we work with a single sample of the subject's be- 
havior and two or more interpreters, we are studying only the re- 
liability of the interpretation. If we provide the independent inter- 
preters independent test records, then we are testing the reliability 
of the test-interpreter combination. Few studies of the reliability 
of predictions from projective tests have been made. One related 
study by Palmer is had trained interpreters make judgments about 


SUMMARY STATEMENT 417 


the persons whose Rorschach records they studied on a check list of 
different attributes. Palmer found very low correlations between 
judges for judgments of this sort based on the same set of records. 

Reliability of Personality Descriptions. How accurately does the 
total picture given by a projective test maintain itself from one 
testing to another, or from one interpretation. to another? Such 
evidence as we have presented on the extent to which Rorschach 
records can be matched with case materials or with diagnostic cate- 
gories is indirect evidence of reliability. There must be some sta- 
bility in the basic Rorschach record if it can be dependably matched 
with anything else. What level of precision the descriptive picture 
achieves, however, is almost impossible to determine from this sort 
of evidence. 


PRACTICALITY 

Projective tests are viewed by their proponents as clinical tech- 
niques that can be expected to give valid results only in the hands 
of persons having both special training in the technique and a high 
level of general sophistication in dynamic psychology. Furthermore, 
-consuming both to give and to score. It 
seems clear, therefore, that whatever use may be justified by the 
validity that they demonstrate will be limited to mental hygiene 
clinics, mental institutions, private clinical practice, and similar set- 
tings in which adequate resources are available. They are not, and 
probably never will be, techniques to be widely applied in schools. 
The teacher and the school administrator are interested in these 
approaches only as consumers. They may have occasion to hear 
some test or the interpretation of a test discussed in connection with 
à particular child. Their need is to know something of what the 
clinician hopes to be able to do with the test and to have some sense 


of the level of confidence to be placed in the results. It is hard to 
e of our knowledge, what the answer 1$ 


the tests are generally time 


decide, at the present stat ў e 
on this last point. Certainly, à substantial admixture of scepticism 


seems to be indicated. But clearly, these procedures are for special- 
ists in special situations. and the story of the help that they can 


provide even then is far from complete. 


SUMMARY STATEMENT 


ntion, exploration, and develop- 


During the past 20 years, the inve 
s been for many psycholo- 


ment of projective tests of personality ha 1 | 
gists the most exciting adventure in personality evaluation. The 


418 PROJECTIVE TESTS 


tests have many ardent supporters and many severe critics. A 
Sound appraisal of their contribution to our understanding of the 
individual is difficult to arrive at at the present time. Both claims 
and results are conflicting. 

А great many of the procedures have received very little by way 
of rigorous and critical test and are supported only by the faith and 
enthusiasm of their backers. In those few cases, most notably that 
of the Rorschach, where a good deal of critical work has been done, 
results are varied and there is much inconsistency in the research 
picture. Modest reliability is usually found, but consistent evidence 
of validity is harder to come by. 

In any event, these techniques are the tools of the trained special- 
ist. They are not likely ever to become part of a general testing 
program. The specialist must appraise them with a critical thor- 
oughness that is not possible within the limits of this single chapter. 


REFERENCES 


1. Beck, S. J., Rorschach's test: Vol. 1, Basic processes, New York, Grune 
& Stratton, 1944, 

2. Benjamin, J. D., and F. G. Ebaugh, The diagnostic validity of the Ror- 
schach test, Amer. J. Psychiat., 1938, 94, 1163-1178. 

3. Cronbach, L. J., Studies 
in the c 


of the Group Rorschach in relation to success 
llege of the University of Chicago, J. educ. Psw hol., 1950, 41, 65-82. 
4. Eichler, R. M., A comparison of the Rorschach and Behn- Rorschach 
inkblot tests, J. consult. Psychol., 1951, 15, 185 189, 
5. Eichler, R. M., Experimental stress and alleged Rorschach indices of 
^, J. abnorm. soc. Psychol., 1951, 46, 344-355, 
‚ Eysenck, H. J., The scientific study of personality, Macmillan, New York. 
1952, pp. 162-163. ` 

7. Holt, R. R., and I.. Luborsky, 
a second interim report, Bull. 

8. Klopfer, B., and D. M. 
World Book, 1942, 

9. Klopfer, B., Developments in the 
World Book, 1954, 


10. Krugman, Judith I., A clinical Validation of the Rorschach with prob- 
lem children, Rorschach Re: ch.. 1942, 6, 61-70, 


11. Lazarus, R. S., The influence of color on the protocol of the Rorschach 
test, J. abnorm. soc. Psychol., 1949, 44, 506-516. 

12. Meadows, A. W. 
cited in Eysenck, H. i 
millan, 1952. 

13. Munroe, Ruth L., Prediction of the adjustment and academic per- 


formance of college students by a modification of the Rorschach method. 
Appl. Psychol. Monogr., 1945, No. 7. 


Research in the selection of psychiatrists: 
Menninger Clin., 1952, 16, 125-135. 
Kelley, The Rorsc hach technique, Yonkers, N. Ys 


Rorschach technique, Yonkers, N. Үз 


^ An investigation of the Rorschach and Behn tests: 


+ The scientific study of personality, New York, Mac- 


QUESTIONS FOR DISCUSSION 419 


14. Murray, H. A., et al., Explorations in personality, New York, Oxford 
University Press, 1938. 
, 15. Palmer, J. O., А dual approach to Rorschach validation: a methodolog- 
ical study, Psychol. Monogr., 1951, No. 325. 

16. Rorschach, H., Psychodiagnos (translation by P. Lemkau and 
B. Kronenburg), New York, Grune & Stratton, 1942. 

17. Siegel, Miriam G., The diagnostic and prognostic validity of the Ror- 
dum tests in a child guidance clinic, Amer. J. Orthopsychiat., 1948, 18, 119- 

18. Tompkins, S. S., The Thematic Apperception Test, New York, Grune & 
Stratton, 1947. 

19. Williams, M., Ап experimental study of intellectual. control. under 
stress and associated Rorschach factors, J. consult. Psychol., 1947, 11, 27-29. 


SUGGESTED ADDITIONAL READING 


Abt. Lawrence Edwin, and Leopold Bellak, Editors, Projective psychology. 
New York, Knopf, 1950. 

Anastasi, Anne, Psychological testing, New York, Macmillan, 1954, Chap- 
ter 22. 

Anderson, H. H., and Gladys L. Anderson, An introduction to projective 
techniques, New York, Prentice-Hall, 1951. 

Bell, John E., Projective techniques, New York, Longmans, Green, 1948. 

White, R. W Interpretation of imaginative productions, in J. McV. Hunt, 
itor, Personality and the behavior disorders, New York, Ronald Press, 1944, 
Vol. I, pp. 214- 51. 


QUESTIONS FOR DISCUSSION 


l. What is projection? Give several examples from your own experience 
9r your observations. i TANA 
2. What basis is there for expecting a projective test to work? Why should 
we be able to tell anything about a person from the types of responses that 
Ne gives to projective test materials? "eoo 6 
3. Why has there been such a divergence of opinion between clinical psy- 
chologists and specialists in measurement аз to the value of projective tests? 
4. In what ways are the situational tests described in Chapter 12 similar 
to projective techniques? In what ways do they differ? | ў 
5. Write down all the different things you can see in the four ink blots in 
Fig. 15.1. If possible, get three or four other people to do the same thing. 
ту to make a rough scoring in terms of the determinants given on pp. ease 
O you find common responses? Whole responses and detail responses: 
Movement responses? How do the different records compares | 
6. Is it possible or desirable for you to make a psychological иер ta DO 
of the material obtained under question 3 What factors limit the interpreta- 


bility: i 112 
Y of this sort of material? enh д 
7. Collect several stories in response to the picture In Fig. 15.2. What 85 

ou something about the person: 


Dects of thes - as if they might tell y 
bs ese look as if they mg 88 j eei E RE 
hat cautions would need to be observed in interpreting this kind of material? 


e? 


420 PROJECTIVE TESTS 


8. Pictures such as those used in the TAT have frequently been used to 
measure attitudes. What are the advantages and disadvantages of this 
method? 

9. In what ways is the play of children а projective technique? 

10. What should be the role of the teacher in regard to projective tests? 
In regard to the informal projective material in pupils’ compositions, art 
work, and other activities through which they express themselves? 


Chapter 16 


Planning a School Testing 
Program 


We want to give some standardized tests in our school. What would 


you recommend? 

l am on a committee to revise the testing program for our schools. 
These are the tests we plan to use, and the grades in which we plan to 
use them, Will you criticize this plan? 


Probably every teacher of tests and measurements faces requests 
What standardized tests should be used? 


like these each semester. 
What constitutes a sound testing 


When should they be used? 
program? 


THE CART AND THE HORSE 


The trouble is that there is really no answer to such a question. 
Or, rather, the only answer is another 


do you want test results for? How are 
What decisions or actions are you proposing to base on them? What 


Needs for information have led you to decide upon a testing pro- 
an only be planned in terms of the purposes 
it is to serve. Tests given with no particular purpose may find a 
use, may create their own market, but it hardly seems likely. A 
functioning testing program should grow out of the needs felt and 
particular school or school system and 
ting those needs and serving those 


series of questions. What 
you hoping to use them? 


Sram? For a program с 


functions to be served in the 
Should be directed toward mee 


functions. 
The first step, then, in planning a testing program for a school or 


school svstem should be to find out what types of information about 
Pupils are felt to be needed by the school staff and how test results 
are to be used. Before asking, „What tests should we give?“ one 
asks, "What information do we need that we do not now have? 
When do we need it? How will we use it?“ There are many situa- 

ts can be a rather futile enterprise. What 
g readiness test if all members of a class 

421 


Uons in which giving tes 
profit is there in a readin 


422 PLANNING A SCHOOL TESTING PROGRAM 


study reading together from the same primer at the same rate and 
time? Will it pay to give reading tests if there are no provisions for 
differentiated individual work or remedial instruction? 

The starting place is the school and ite curriculum, the staff and 
their needs. Of course, it cannot be expected that cach single teacher 
will have seen in advance how test data аге to be used in forwarding 
his activities with his class. Learning to use test information repre- 
sents one aspect of in-service growth. But a testing program un- 
related to local needs, local resources, and local levels of sophistica- 
tion is unlikely to function effectively. Planning that does not cen- 
ter around the ways the staff are to be brought into using the test 
information is likely to be sterile. For tests are given to be used, 
not to be filed. More important than planning what tests are to be 
given is planning how the tests are to be used. 

This is why it is that planning a testing program in the abstract 
or in a vacuum is so unsatisfactory. 
needs to raise is: For what purpose: 
schools? Defining functions 


The first question one always 
s will the tests be used in your 
and purposes is the horse. Let us put 


him out in front, and the cart carrying a program of tests will follow 
after. 


FUNCTIONS OF A TESTING PROGRAM 


As we use the phrase lesting program in this chapter, we are re- 
ferring to an organized school-wide 
administration of standardized 
wide variety of teacher-m 
a single school nor of the 
carried on to study sing 


discussed at some leng 


or system-wide program for the 
tests. We are not thinking of the 
ade tests that are Prepared for use within 
specialized testing procedures that may be 
le pupils. Teacher-made tests have been 
th in Chapters 3 and 4 
further consideration in Chapter Ks 


and reporting. 


and will receive some 
in which we consider marking 


Of course, many of the needs for the inform 


supply and many of the functions to be 
are common from one 


ation that tests can 
served by a testing program 
school system to another, 
Table 16.1, a brief catalogue 


This catalogue may serve as 


We present, in 
of functions often served by tests. 
a check list to guid | 
needs and uses, However, the applicability of these functions must 
always be tested in the local setting. We must ask 
do we wish to, use tests for th | 

A bricf discussion of the 
tion follows. 


е à review of local 


: "Can we, or 
1S purpose in our schools?” 


nature and appropriateness of each func- 


CLASSROOM FUNCTIONS OF A TESTING PROGRAM 


Table 16.1. 


Classroom Functions 
Grouping pupils for in- 
struction within a 

class. 
Guiding the planning of 
activities for specific 
individual pupils. 
Identifying pupils who 
need special diagnos- 
tic study and remedial 
instruction. 
Determining reasonable 
achievement levels for 
each pupil and evalu- 
ating discrepancies 
between potentiality 
and achievement. 


gning course grades. 


Guidance Functions 

Preparing evidence to 
guide discussions with 
parents about their 
children. 

Building realistic self- 
pictures on the part 
of pupils. 

Helping the pupil with 
immediate choices. 
Helping the pupil to set 
educational and voca- 

tional goals. 

Improving counselor, 
teacher, and parent 
understanding of 
problem cases. 


423 


Possible Functions of a Testing Program 


Administrative 
Functions 
Forming of and assign- 
ing to classroom 

groups. 

Placing new students. 

Helping determine eligi- 
bility for special 
groups. 

Helping determine 
which pupils are to be 
promoted. 

-aluating curricula, 
curricular emphases, 
and curricular experi- 
ments. 

Evaluating teachers. 

Evaluating the school 
as a unit. 

Improving public rela- 
tions. 

Providing information 
for outside agencies. 


CLASSROOM FUNCTIONS OF A TESTING PROGRAM 


A number of the functions of a testing program center around the 
work of the classroom teacher. These have to do with grouping for 
instruction, individualization of instruction, selection for special di- 
ignment of marks. 


agnostic and remedial services, and the a 

Grouping Pupils for Instruction Within a Class. One of the effec- 
live techniques teachers have developed for dealing with individual 
differences in pupils is to form within the class small groupings of 
pupils who have about the same level of skill. Pupils in these group- 
ings may work on the same materials and at the same speed. Stand- 
ardized tests are often called upon to aid in forming these within- 
class groups. They provide information quickly and objectively at 
the beginning of the school year and make it possible to short-circuit 
the slower and more subjective process of getting acquainted with 
Readiness tests serve this function 


each pupil's abilities and skills. 
in the first grade, and achievement tests at later levels. 

Guiding the Planning of Activities for Specific Individual Pupils. 
Many teachers report that they use test results to help in individ- 
is carrving the small-group procedure still 


ualizing instruction. This 
further, Programs of work in the skill subjects are adjusted to the 


424 PLANNING A SCHOOL TESTING PROGRAM 


present level of the individual pupil. The gifted child is encouraged 
to move ahead at his own speed, and enrichment activities are pro- 
vided for him. The child of limited achievement is permitted to 
move more slowly toward more limited objectives. Both measures 
of scholastic aptitude and of educational achievement play a role in 
this type of planning. 

Identifying Pupils Who Need Special Diagnostic Study and Reme- 
dial Instruction. When a school system has resources for special 
diagnostic study of individual pupils and special teachers to provide 
remedial instruction, the testing program will usually. provide im- 
portant data to help in identifying the pupils most likely to profit 
from that instruction. If the classroom teacher must pick pupils 
for such special services, he is likely to pick the pupils whom he con- 
siders most below par in achievement. 
distinguish between general low ability and specialized deficiency in 
a particular limited skill. A testing program that appraises both 
achievement and aptitude is an aid in picking out those pupils who 
have a remediable defect. 

Evaluating Discrepancies Between Potentiality 
Picking pupils for special remedial training is a special case of the 
more general problem of identifying discrepancies between poten- 
tial and actual achievement. Such discrepancies may serve to focus 
the efforts of the teacher upon particul 1 
may help to orient and guide the teache 
with parents, or may influence the statements or ratings in the peri- 
odic report to the home. This last situation is encountered in those 
school Systems in which the report card 
individual pupil's achievement in relation to his potentiality. Any 
attempt to evaluate achievement in relation to potentiality requires 
good measures of both potentiality and achievement. One function 
ofa testing program may be to help supply these. 

Assigning Course Grades. A number of te 
secondary school, report using stand 
sideration in assigning course g 


It is very difficult for him to 


and Achievement. 


ar pupils in his own group, 
T's discussions in a conference 


attempts to evaluate the 


achers, especially in the 
ardized test results as one con- 
à i rades. This may be appropriate in 
certain specific courses for which standardized tests exist whose con- 
tent parallels closely the objectives of instruction in the course. 
However, the school-wide or city-wide program of standardized test- 
ing ordinarily seems less well fitted to serve this purpose. Standard- 
ized test content is typically general content, not directly related to 
the teaching in any single school or grade. ii 
range of knowledges and skills. 
test is not particularly well 


It covers broadly a whole 
For that reason, the standardized 
adapted to measuring the particular 


GUIDANCE FUNCTIONS OF A TESTING PROGRAM 425 


things a pupil has learned in a particular class or period of instruc- 
tion. 


GUIDANCE FUNCTIONS OF A TESTING PROGRAM 
Much school testing is carried out primarily to service the school's 
The tests are useful in dis- 


program of guidance and counseling. 
The results serve to 


cussing a pupil's school progress with parents. 
help the pupil build up a more accurate and realistic self-picture, to 
make immediate choices between alternatives offered by the school, 
and to formulate appropriate long-range goals and plans. They 
help school personnel to understand and plan for problem cases, 
providing material for staff discussion and for case conferences. 
Reporting Progress to Parents. The school has the responsibility of 
keeping parents informed of the progress of their children. This 
information needs to be conveyed whether the pupil is doing well or 
ill. In many schools, and we hope the number is increasing, there 
аге regular opportunities for all parents to meet with their child's 
teacher and discuss his progress. Whether the setting be one of 
sweetness and light or whether the conference grows out of some 
distressing problem, it helps if the teacher can document his report 
on performance. This removes the 
f individual, subjective (and in 
ased) opinion and sets it on a 


With concrete, objective evidence 
teacher's appraisal from the realm o 
the eyes of the parent. perhaps very biz 
foundation of fact. We shall consider in some detail later in the 
chapter the manner in which the results can best be expressed. 
Building Realistic Self-Pictures. A general function of the guid- 
ach individual's understanding of his 
ngths and weaknesses. А testing 
ctive evidence on these strengths 


ance program is to improve € 
Own assets and liabilities—his stre 
program can provide the school obje 
and weaknesses to be interpreted to the individual pupil. 

Telping the Pupil with Immediate Choices. The pupil at the sec- 
ondary-school level usually has a certain number of choices to make. 
He must decide whether to take certain courses or determine in which 
of alternative curricula he wishes to enroll. The evidence provided 
by a testing program can enter into the thinking of pupil and coun- 
selor about these choices. 


Helping the Pupil to Set Educational aud Vocational Goals. In ad- 


dition to providing information for immediate choices, a testing pro- 
gram can contribute to the individual's long-range planning for fur- 
ther education or for work. The sclf-picture that is built up in part 
by the counsclor's interpretation of test results will influence these 


plans and the actions to implement them. 


426 PLANNING A SCHOOL TESTING PROGRAM 


Improving Understanding of Problem Cases. In each school, some 
pupils present more or less acute problems of educational or social 
adjustment. They are the unruly, the withdrawn, the unhappy, the 
educationally retarded, and the other children who are not fitting 
into the educational and social pattern of the school. A testing pro- 
gram provides some basis for counselors and teachers to understand 
these cases. In particular, a systematic testing program combined 
with an adequate record system can provide some of the historical 
background for understanding the present problem. Every present 
problem has its roots in the past. Records of regular testing follow 
back at least some of these roots and throw some light on the present 
problem. 

Another aspect of the testing resources of a school system should 
consist of facilities for more intensive study of children who 
special problems. Here, availability of a wide variety of 
techniques to fit the needs of the Specific case is the desideratum. 
This special testing is distinct from and should supplement the uni- 
form program applied to all children. 


resent 
testing 


ADMINISTRATIVE FUNCTIONS OF A TESTING PROGRAM 


So far we have seen ways in which a program of standardized test- 
ing may help the teacher in dealing with the children of his class or 
may contribute to the guidance activities in a school. There are 
also a number of administrative functions for which a testing program 
has at times been used. These include forming class groups, assign- 
ing transfer students, determining | 


га! eligibility for special groups, de- 
termining who is to be promoted 


» evaluating curriculum, evaluating 
teachers, evaluating a school as a unit, improving public relations, 
and providing service to agencies outside the school. 

Forming and Assigning to Classroom Groups. When a school 
is large enough to have several groups at the same grade level or 
several sections of a particular course 


а decision must be reached on 
some grounds 


as to who goes into which group. We have discussed 
elsewhere the issue of homogeneous grouping (see pp. 236 238). 
Grouping is probably likely to be 


s more effective in single special 
courses, i.e., mathematics, 


с \ English, science, ete., at the secondary or 
higher level. Grouping on the basis of initial le 


the subject permits real differenti 
of progress. In any event, 
made to form class g 


vel of achievement in 
ition in content covered and rate 
if an administrative decision has been 


l roups on the basis of aptitude or of initial level of 
achievement, the testing program can be org: 
those ends. 


anized so as to serve 


ADMINISTRATIVE FUNCTIONS OF A TESTING PROGRAM 427 


Placing Students Transferred from Other Schools. When a pupil 
transfers from one school to another, a decision must be reached as 
to the grade level and group into which he is to go. Of course, he 
may be assigned purely on the basis of age or of his grade in the 
previous school. However, if a school system ever takes into account 
the student's level of achievement in determining grade placement, 
these transfers would appear to be particularly appropriate cases in 
which to do so. The break in continuity represented by a change 
in school and in schoolmates should reduce any upset associated with 
repeating or skipping a grade. Plans for prompt testing of transferred 
students should be included in a program that proposes either to 
consider achievement level in determining grade placement or to 
adapt instruction in the class group to the particular needs of the new 
child. 

Helping to Determine Eligibility for Special Groups. The school 
тау provide for certain special groups or special courses for which 
prerequisites have been set. On the one hand, there may be classes 
for slow-learning children, and admission to these classes may be 
contingent upon having an 1.0. below a specified level. On the other 
hand, admission to certain subjects such as algebra may be restricted 
to those falling above a certain minimum in I.Q., arithmetic achieve- 
Ment, or score on a specific prognostic test. Rigid and mechanical 
adherence to specified score standards seems justifiable more in terms 
of administrative convenience than educational policy. However, 
some basis is required for determining eligibility for any special pro- 
gram. The information vielded by a testing program can provide 
part of the basis for determining this eligibility. 

Helping to Determine Which Pupils Are to Be Promoted. We can- 
not at this point enter into a discussion of the pros and cons of having 
à pupil whose educational achievement is far below that of his class 
group repeat a grade. Certainly since 1925 there has been a great 
reduction in the amount of repeating and a tendency to give weight 
to factors other than just educational achievement when the deci- 


sion to repeat is made. However. if the issue of non-promotion 1s 


being considered for a child, objective appraisal of achievement and 
aptitude on standardized tests can serve to supplement and docu- 
ment the teacher's judgment of school progress as evidence to be 


considered. 

Evaluating Curricula, Curricular Emphases, ат 
ments. So far we have been considering administrative 
actions with respect to him. Using 


of the individual pupil and 
Standardized tests for these purposes has been under some attack, 


and Curricular Experi- 
evaluations 


428 PLANNING А SCHOOL TESTING PROGRAM 


and the value of such procedures is at least open to question. Evalu- 
ation of the local curriculum or of curricular experiments or innova- 
tions is a happier administrative function for a testing program. Ap- 
plication of test results to the evaluation of a local curriculum or 
of a curricular experiment or innovation must be made judiciously. 
Standardized tests appraise at best only part of the range of curricu- 
lar outcomes. They need to be supplemented by other more informal 
techniques of appraisal. Test results will need to be interpreted in 
terms of the total picture of tangible and intangible objectives. 
However, in any experimentally. minded program it will be impor- 
tant to determine how adequately the basic common skills are being 
maintained. 

Evaluating Teachers. Some administrators have used the results 
of a standardized testing program as one basis for evaluating the 
competence of individual teachers. The reasons for considering this 
an undesirable procedure have already been discussed rather fully 
in Chapter 11 (p. 294). 

Evaluating the School as a Unit. 
school administrator as a type 
mary tabulations for the sep 


А testing program can serve the 
of educational quality control. Sum- 
arate schools and classes in a system, 
reviewed in the light of the aptitude and the background character- 
istics of each group, can serve to point out strengths and weaknesses 
of particular schools and classes. The information can guide super- 
visors to places or persons needing special help or needing to make 
some shift in emphasis. If this quality control function of tests is 
applied in a punitive way, it may have the same disruptive effect 
that is likely to accompany the use of tests to judge teachers. But 
if the orientation of the central administration is toward helping 
rather than judging, the testing program can serve a valuable func- 
tion in directing that help where it is needed. 
Improving Public Relations. The schools of 


2 nq a community are al- 
ways fair game for critics, 


% It is always open season for the schools. 
One of the recurring themes, especially when a school deviates from 


traditional patterns, is that pupils are no longer learning the basic 
three R's, which is what the parents sce as the objective of going to 
school. A program of standardized testing provides the basis for 
answering such criticisms. Since the skills with which the critics 
are most likely to be concerned correspond rather closely to those 
measured by standardized tests, standardized test reaks provide 
very relevant information for answering or forestalling many: critic 
If the schools are in fact Maintaining the expected. level in basic 
skills, this fact can be demonstrated. and the administration can 


PARTICIPATION IN PLANNING A PROGRAM 429 


move on into an exposition of the further outcomes it is trving to 
achieve. 

The public relations function of testing becomes particularly im- 
portant when a school is introducing some departure from the tried 
and true way of doing things. When curricular innovations are 
introduced, the administration must be prepared to meet cries that 
the fundamental skills are no longer being mastered by pupils under 
the new system. If it is a fact that they are not being mastered, 
the administration certainly needs to know it; it if is not, the ad- 


ministration needs to be able to refute the charge. 

The role of the classroom teacher as a front-line worker in the bat- 
tle of public relations should not be forgotten. Relevant evidence 
from studies of achievement in the schools of the community or of 
evaluations of curricular changes should be put in the hands of the 
teachers, so that they may be able to respond to criticisms and pre- 
sent to the community a true and authentic picture of achievement 
in the schools. 

Providing Information for Outside Agencies. The school is fre- 
quently called upon to supply information to other agencies—to 
Social agencies, to potential employers, to other schools and colleges. 
A testing program, coupled with an adequate system of cumulative 
records, permits supplying needed information in standard terms as 
it is requested. Objective information on aptitude and achievement 
can be made available as needed. 


PARTICIPATION IN PLANNING A PROGRAM 


We have seen the great variety of functions for which the results 
ofa testing program may be used. We have seen how the test scores 
may be useful to the classroom teacher in working with individual 
Pupils and groups of pupils, how the test results enter into the 
guidance program, and how test results serve functions relating to 
supervision and administration. This outline indicates functions 
that tests may serve. Whether they will serve these functions effec- 
tively depends upon whether the testing program is planned with 
these objectives in view and whether it is understood and supported 
» reasons, we believe that within a 


by the potential users. For th : 
School svstem there should be widespread participation in the origi- 
nal planning and periodic review of a testing program. 
Since the main categories of potential users are (1) administrative 
and supervisory personnel, (2) guidance personnel, and (3) classroom 


teachers, we believe that representatives of these three groups should 


430 PLANNING A SCHOOL TESTING PROGRAM 


participate in the planning. The needs of each group should be con- 
sidered in planning the tests to be used, the time of their administra- 
tion, and the manner in which the test results are handled. Partici- 
pation should not cease when a Specific testing program has been 
introduced but should continue in the form of periodic review of the 
adequacy of the tests and the ways of handling them and in in-serv- 
ice development of better wavs of using the test results. One may 
anticipate that when teachers participate in planning and reviewing 
the testing program there will be the greatest likelihood that the 
test scores will function in important wavs in classroom activities 
and the least chance that test papers will be filed away or scores 
entered on record cards and forgotten. 


QUALITIES DESIRED IN A TESTING PROGRAM 


What are the general characteristics of a good school testing pro- 
gram? We shall consider three briefly: relationship to use, integra- 
tion, and continuity. 


RELATION TO USE 


We have outlined in the previous section a number of functions 


for which standardized tests are used in some school systems. The 
first step in planning a testing program for your schools is to review 
these and possibly other functions and determine for which of them 
tests are needed and will be used in your schools, The testing should 
then be planned in relation to these uses. Tests should be selected 
and the times at which they are given should be chosen so that the 
needed information will be available and as up to date as possible 
at whatever time it is needed. 

Thus, suppose teachers in the first grade Wish to use test results 
to help them form in their classes subgroups that will move into 
reading at different rates, Our need is then for 
test. If almost all pupils go to 
scheduled for the end of the 
more likely we will w 


a reading readiness 
kindergarten, the test might be 
kindergarten year, sav, in May. But 
ant to give the s i 
the fall, say, about the beg 
have settled down in their т 
Or again, suppose that 
available for the three years of senior hig 
pupils work out plans for the | 
of the ninth grade in the 


gh test in the first grade early in 
шаша of October, as soon as the pupils 
new class. 

a differentiated high-school program is 
Я h school. Counselors and 
high-school program during the spring 
light of available с 


information on pupil 
In this setting, a 


aptitudes and achievement. program of aptitude 


INTEGRATION 431 


and achievement testing during the first semester of the ninth grade 
will provide relevant and current information. Here, as everywhere, 
the important thing is to provide what will be used when it will be 


used. 


INTEGRATION 

The testing program should be seen as a whole. Information that 
is needed in the sixth grade is not unrelated to the information that 
was obtained in the fifth grade or to information that will be useful 
in the seventh grade. Each item of information gathered should be 
obtained at such a time and in such a way that it will make the max- 
imum contribution to the total. Several aspects of integration merit 
consideration. 

In an integrated program, it will usually be desirable to use the 
Same series of tests over the grade range for which the series is ap- 
propriate. Thus, if the Metropolitan Achievement Tests are. being 
used to measure progress in basic skills, it will probably be desirable 
to use them in апу grade from first up to sixth, and possibly to eighth 
or ninth, in which an achievement battery is being used. The ad- 
vantages are that norms are based upon the same sampling of com- 
ide to grade, and the tests conform to a common 


munities from gr: 
Thus, scores from one grade to the 


outline of content and format. 
Next are more nearly comparable, so that a truer picture may be 
obtained of pupil growth. 
Integration implies particularly integration between the several 
divisions of the school program. Tests in junior high school should 
be planned in relation to those already given in the elementary 
School, and those in senior high should take account of junior high 
testing. An intelligence test given in the sixth grade need not be 
followed by a similar test at the beginning of the seventh grade. 
If aptitude tests are given in the ninth grade, there is limited gain 
from a similar battery in the tenth. 
Integration between divisions of the 


School records. The records accumulated about 
him, in whole or in part, when he goes 


On to the secondary school. It may be that the complete record, 
specially if it is a verv full one, should not stay in the active record 
file. However, key information should carry over into the record 
System of the higher school, and the full record should be available 


for reference if need be. 
Integration means, finally, 
sible multiple purposes can 


school implies continuity of 
a pupil in the ele- 


mentary school should follow 1 


sting so that in so far as pos- 


timing te 
An attempt to serve two 


be served. 


432 PLANNING A SCHOOL TESTING PROGRAM 


masters is always a compromise. However, it sometimes represents 
a sound use of limited resources. Thus, an intelligence test fairly 
early in the sixth grade can serve adequately for sectioning and 
guidance in the seventh grade and still be available as a resource 
for studying problem cases and issues of promotability in grade 6. 
A scholastic aptitude test in grade 10 or 11 gives as good a predic- 
tion of college success as one taken in May of the senior year and is 


also available for counseling purposes during two or three years of 
high school. 


CONTINUITY 


The potential values of a testing program increase as it is con- 
tinued over a period of years. Advantages of continuity in the pro- 
gram are two-fold. On the one hand, data accumulate in the records 
of the individual pupil. There are available to contribute to an under- 
standing of the youngster not merely the results of tests given during 
the present vear but also the dat 
years ago in lower grades. 
in relation to earlier record 
demic problems in the 
they represent mere 


a from earlier testings one or more 
Present status can be seen in perspective 
s. Thus, we can see whether Jerry's aca- 
sixth grade have developed recently or whether 
ly the continuation of an early trend; we can 
determine whether the difficulty Mary is having i 
is new or whether it has its roots in e 
fundamentals. 


with long division 
arly difficulties with arithmetic 


Continuity also is of importance in permitting a school system to 
get to know the particular tests itis using. We have emphasized at 
various points that a good deal of caution must be exercised in ap- 
plying national norms to a local setting. 
test or test series permits the development of local standards of ex- 
pectancy. This may take the form of an informal and implicit tem- 
pering of national norms in interpreting local performance. It тау 
take the form of an actual set of local norms. Thus, a suburban com- 
munity with a very high percentage of pupils going on to college 
may find that percentiles based on their own school population Pro- 
vide a more appropriate framework for judging the academic status 
of one of their pupils than do national age or grade norms. 

Getting to know tests implies getting to know what they measure, 
as well as establishing local standards of expectancy. If teachers 
work with the tests and test results, they will come to know what 
the test covers, what cues for diagnosing &roup strengths and weak- 
nesses can be drawn from it, and what its limitations are. Both 
types of familiarity are desirable when teachers are using test results 


Continued local use of a 


A PROGRAM FOR THE ELEMENTARY SCHOOL 433 


to help them with their work. For this reason, a school system will 
ordinarily wish to continue to use the same tests over a period of 
years, changing them only when they become out of date or when a 
study of other available tests indicates that there are ones available 
that represent a definite improvement over those that have been 


used in past vears. 


SUGGESTED PRIORITIES IN A TESTING PROGRAM 


At the beginning of the chapter we indicated an unwillingness to 
prescribe a particular pattern for a testing program. This unwilling- 
ness stemmed in part from the different functions which the pro- 
rve in different schools. It stemmed also from the wide 
Variation in financial and professional resources available in different 
school systems. However, there is a general core of uniformity, in 
the same types of tests are available to all 


gram may 


Spite of diversity, and 
schools, That being so, we shall offer some suggestions on the tests 


we consider likely to prove most generally useful in a program at 


the different levels. 


A PROGRAM FOR THE ELEMENTARY SCHOOL 

In the elementary school, our concern centers in helping the in- 
dividual to master the tools of learning and communicating while 
he is learning to live and work in a group of his fellows. At the same 
time, the individual is building up а background of experience, 
ing at his own level. It is in this setting 
Though it is difficult to arrange choices 
r and function 


knowledge, and understand 
that tests must function. 
In an ordered sequence, since tests relate to each othe 
attempted to do so roughly. 

ion it is assumed that children have been 
d hearing. We have thought of these 
al examination rather than as part of 
They are, of course, of fundamental 
able educational experience for 
so-called "stupid" 
ause he could 


In teams, we have 
In all subsequent discuss 
adequately tested for vision an 
Measures as part of the physic: 
the school testing program. 
Importance in guaranteeing à profit 
cach child. There is nothing more tragic than the 
ot nothing out of school bec 


chinas 
hild in the back row, who g 
or see what was written on 


Not hear what the teacher was saying 
the blackboard. 

Reading Tests. We are disposed to give 
to tests of reading ability. 
for acquiring all types of organized 
av world its supremacy 18 chal- 


first place in our program 
of clementary-school testing Reading 
has always been the key avenue 
knowledge. Though in the present-d 


434 PLANNING А SCHOOL TESTING PROGRAM 


lenged somewhat by movie, radio, and television, learning from books 
will continue to be at the heart of education, especially at the higher 
levels. For this reason, aiding the school in making early identifica- 
tion of poor progress in reading and in keeping track of reading prog- 
ress through the school years seems to be among the most useful 
services a testing program can perform. 

If our very meager resources permitted only a single reading test, 
we suspect that we would give it at the end of the second or be- 
ginning of the third grade, so that it might be available to identify 
individuals for special help for a vear or so before they reached the 
greater variety of content and the more extensive demands upon 
independent reading during the later elementary grades. However, 
we would like to be able to give a reading test every year or two 
from the beginning of the second grade throughout the elementary 
school. 


Group Intelligence Test. 
and other aspects of acade 
ards for pupils, and to 


To aid in interpreting reading test results 
mic achievement, to help in setting stand- 
aid in understanding problem cases, we would 
like to have results from intelligence testing. When we are thinking 
in terms of a minimum program and are considering the practical 
realities of time and cost, we must settle for a group test. If it is to 
be used in conjunction with our early 
must be a non-reading test of intellig n 
early as the beginning 


third-grade reading test, it 
ence. But group tests given as 
of the third grade do not have too satisfactory 
reliability or stability over time. If we can afford only a single group 
intelligence test, we would probably do well to delay it until the 
fourth or fifth grade, when the results will be more dependable. If 
a test is given in the second or third grade, we would like to be able 
to include at least one more group intelligence test during the upper 
elementary grades, fourth, fifth or sixth, the choice of grade depend- 
ing upon plans for testing in junior high school. We would not ob- 
ject to additional tests, given fairly adequate resources. These would 
serve primarily to increase the reliability of our appraisal. We would 
be glad to have both verbal and non-verbal measures of intellectual 
ability. 

Basic Skills Battery. Competing. with the 
second place in our program would be a battery covering the basic 
skill subjects. This would, of course, include j 
replace the separate reading test. 
once, we would probably choose 
results could be used by the 
of instruction for pupils or fc 


intelligence test. for 


reading and could 

If the battery could be given only 
to give it in grade 3 or 4, where the 
school for planning individual programs 
© fitting pupils into special programs of 


А PROGRAM FOR THE ELEMENTARY SCHOOL 435 


remedial instruction. However, we would like to be able to test 
children with such a battery every vear, starting in the third grade 
or possibly the second. We have a certain bias in favor of carrying 
out this testing in the fall, so that the fresh results may be available 
for use by the teacher who has worked with the testing. 

Reading Readiness Test. On the assumption that the teachers in 
our school have an individualized program of instruction in reading 
for the first grade, we would place a reading readiness test next upon 
our list. We might possibly move it higher. We would view this 
test as a. partial guide to the first-grade teacher in organizing sub- 
groups for reading instruction and as a basis for helping her to evalu- 
ate the progress of individual pupils. 

"The four types of tests we have listed so far are the ones commonly 
found in school testing programs. Other types of tests will be found 
much less frequently. Some are not used because of cost; some, 
because what they have to offer seems less important. 

Individual Intelligence Tests. Individual intelligence tests have 
certain advantages over group tests, particularly in the carly grades. 
Ifa community has adequate resources and trained personnel, it 
would be well worth while to have an individual test administered to 
each child toward the end of kindergarten or in the first grade. Ап 
individual test later in the elementary school could be used to replace 


one of the group tests. 

Achievement Tests in Content Subjects. Some of the achievement 
tests for the upper elementary grades include sections dealing with 
content areas of science, literature, and social studies. Measures in 
these areas тау have some value for administrative appraisals of 
the school program, though they are probably too general in content 
to help the classroom teacher greatly in her work with individual 
pupils. Other special achievement tests may appeal to certain 
schools, but we do not recommend them as general features of an 
clementary-school program. 

Other Types of Measures. Other types of tests may be called for 
in the case of individual children. These include the diagnostic tests 
used in the special study of children with disability in a particular 
subject. They include also the techniques of clinical testing, pro- 
jective and otherwise, used by the clinical psychologist in studying 
a problem case. However, these represent supplements to the test- 
ing program, rather than a basic part of it. 

We have not recommended paper-and-pencil personality question- 
naires because of serious doubts as to (1) the validity of the infor- 
Mation they provide in the typical school situation and (2) the sound- 


436 PLANNING А SCHOOL TESTING PROGRAM 


ness of interpretations that school personnel can make from the re- 
sults. This does not mean that personality development. of the 
elementary-school child is of no concern to the school. Rather, it 
means that understanding the child as a person must depend upon 
the direct and more informal observations of cach pupil by the teacher 
and other school personnel. 


THE SECONDARY-SCHOOL TESTING PROGRAM 


In the secondary school, the pupil has reached a point where usu- 
ally a number of educational choices and decisions must be faced. 
He may have to make choices of particular subjects or between 
different curricula. At the same time, he must start thinking about 
future plans: how long to stay in school, whether to go on to college, 
what to plan for in the world of work. The curriculum has, to a 
considerable extent, moved beyond the basic skills and deals with 
various tvpes of special content. 
à somewhat lesser extent, arithme 
of these later content learnings. 


However, reading, writing, and, to 
tic are still called for in the service 
These shifts in emphasis appear to 
call for a corresponding shift in the pattern of the school testing 
program. The order of priority we propose is outlined below. 
Scholastic Aptitude Test. A major decision for cach individual is 
how far up the educational ladder he shall seek to go. A related 
decision, where several distinct high-school curricula are available, 
is whether he shall continue with a college preparatory program 9r 
enter a commercial, industrial, or general program. A scholastic 
aptitude test is of value, as a supplement to school grades, in arriv- 
ing at these decisions. It is, of course, also of value in interpreting 


the pupil's 55 and setting the standard of expectancy for 
him. 


school progre: 


By high-school age, the abilities measured by a scholastic aptitude 
test have become Pretty well stabilized. "There is little systematic 
shift from one year to the next. Thus, a tenth-grade test will serve 
to estimate scholastic aptitude at the end of the twelfth grade as 
well as one given later. For this reason, if only a single test is to be 
given, it may be given early, so that it may be used throughout 
the school program. If more than one testing can be provided for. 
the tests may be Spread fairly evenly through the high-school 
years, 


Reading Test. In secondary and hig 
ing becomes, if possible, even More critical. There is relatively 


little to be gained by giving a reading test unless the school has some 
provisions for taking action on the ba 


her education, the role of read- 


s of the results and providing 


THE SECONDARY-SCHOOL TESTING PROGRAM 437 


guidance in remedial activities. Where such resources are available, 
however, suitable reading tests to identify candidates for the reading 
program can play an important role in the school program. Since 
the constructive use of this testing implies remedial action, it is 
desirable that the testing be early in the program of the particular 
school—in the seventh grade for the junior high school, the tenth 
grade for the senior high. 

Tests of Special Aptitudes. With an eye to vocational counseling, 
especially for those who will not go on to college, a high-school 
testing program can well include measures of aptitudes important 
for job success. These may be separate tests of mechanical, clerical, 
and spatial aptitudes. They may be incorporated into a single bat- 
tery such as the Differential Aptitude Test Battery. If such a com- 
plete battery is used, it can take the place of the scholastic aptitude 
test, since certain of the subtests can be combined to give a suitable 
estimate of scholastic aptitude. 

Tests of specialized aptitudes are usually not well adapted to use 
below the eighth or ninth grade. The particular time at which they 
аге used will depend upon the total program of guidance in the school. 
arried out so that results will be available and as 
grade in which counseling is focused and 
in which major decisions must be made. This may well be the ninth 
grade in a junior-senior high-school system. Additional data from 
further testing in the eleventh grade would make possible more re- 
liable estimates of aptitude, to be used in counseling at the end of 
the school program. 

Interests. Interests join with aptitudes in providing the back- 
ground for vocational planning. For this reason, an interest inven- 
supplement to aptitude measures, and 
appropriately be gathered at the same 


l'esting should be c 
ир to date as possible in the 


tory seems an appropriate 
evidence on interest. could 
time that the evidence on aptitudes is assembled. At the secondary 
level, an instrument that e 
such as the Kuder Preference 
provides ratings for specific jobs. 
Achievement Tests in Content Areas. 
achievement tests in the content areas some 
secondary school, because we believe that they 
have somewhat less to contribute to the guidance of pupils than the 
types of tests we have already considered. However, evaluation of 
achievement by standardized tests in the general areas of secondary 
1 al studies, and the arts and letters, does have 
her education. A set of such 


valuates in terms of general interest areas, 
Record, seems preferable to one that 


We have placed standardized 
what lower down on the 


list for testing in the 


education, science, soci 
value in appraising potential for hig 


438 PLANNING A SCHOOL TESTING PROGRAM 


tests would have value in the ninth grade if choices are to be made 
with respect to the particular program in the senior high school. 
А survey achievement battery might be included again in the eleventh 
or twelfth grade, in connection with the final decision as to collegiate 
education. 

Prognostic Tests. There are several subjects for which special prog- 
nosdc tests appear to have some merit. These include algebra, for- 
cign languages, and shorthand. A school may perhaps wish to give 
such tests the year before the particular program of instruction starts 
and use them as one factor in Screening students. 

Personality and Adjustment Inventories. 
tions about the practical value of 
assessing personality at this level also. However, they are probably 
more usable in the secondary school than in the elementary grades. 
Their use might prove of some value if counscling personnel with 
good background are available and the inventories are used merely 
as rough screening devices to pick up pupils for more thorough 
study. 


We have serious reserva- 
paper-and-pencil techniques for 


A TESTING PROGRAM FOR THE COLLEGE 


At the college level, students’ programs become still more diversi- 
fied and specialized. Basie skill 
matic attention is paid to them, 
man courses. The acade 


$ are established, and little syste- 
except perhaps in one or two fresh- 
mic pace is generally stiffer, and maintain- 
ing that pace becomes more of a problem. 
own, and for some individuals crises arise in adjusting to a more 
independent existence. The problems of vocational choice become 
more immediate and pressing. 

At each educational level, a ba 
services should provide for g 
ual cases. The counseling 
testing may be carried out 


Students are more on their 


‘anced program of measurement 
iving special tests as needed by individ- 
center, in which guidance is available and 
as seems indicated for each client, should 
play a major role in the measurement program of a college. At this 
level problems are more individual, and there 
form program in which the same information is gathered for every- 
body. However, certain types of information of sufficiently general 
interest may with advantage be gathered for all students. 

Scholastic A plitude. Since a substantial number of college students 
have difficulty in maintaining satisfactory schol 
dence on scholastic aptitude is often needed 
with respect to academic difficulties. 
and quantitative abilities may be 


is less place for a uni- 


astic standing, evi- 
as a basis for counseling 
Separate appraisal of verbal 
of value in permitting a more diag- 


A TESTING PROGRAM FOR THE COLLEGE 439 


nostic appraisal of abilities. The test should be given at the time of 
admission so that the results тау be available for use throughout the 
college course. 

A battery of special aptitude tests could be used in place of the 
Scholastic aptitude test. However, there appears to be somewhat 
less of a need for special aptitude tests in a college population. The 
decision to go on to college has already somewhat narrowed down 
the range of occupations for which the group is preparing, and tests 
of mechanical, spatial, and clerical abilities are of rather less signifi- 
cance for college students than for a high-school group. 

Reading. A measure of reading ability contributes some further 
basis for understanding problems of academic failure. It has a special 
function if the college provides a reading clinic in which remedial 
instruction тау be obtained. A reading test given at the time of 
college entrance and interpreted in conjunction with the scholastic 
aptitude test provides one basis for locating students who might ad- 
Vantagcously receive such special help. 
| With final. vocational choices drawing close, early in 
а time at which a vocational interest test can 
Thus, if a test of interest 
Vocational Interest. Blank, 


Interests. 
college appears to be 
advantageously be given to all students. 
in specific vocations, such as the Strong 5 
is given during the sophomore усаг, the results can be available for 
consideration at the time that choices of major field are made. Since 
at that age remain quite stable, the scores will be 


Interest patterns | 
hroughout the remainder of the col- 


suitable for special counseling t 
lege course. . 

Placement Tests in Special Subjects. Some colleges undertake to 
s, such as the freshman English 


Section certain introductory course : р. 
on the basis of ability 


COurse or the freshman mathematics course, on 
Such sectioning permits different content 


Of the entering students. Г 1 5 
roups. Where this is done, stand- 


and instruction for the extreme 8 à stan 
ardized tests in the subject area provide one way of accomplishing 
the grouping quickly and efficiently. 
Adjustment Inventory. The урса” ON | 1 
ter suited to college students than it is to less mature and less edu- 
cated groups Where extensive counseling services exist, such an 
inventory SENET a group administration of projective tests, might 
ZR ж M Г >] S 4 "ove "e 

be used to screen out individuals for further study. However, we 


have serious reservations as to the value of such a procedure. Or- 


dinarily at the college level counse 1 
the client. We« uestion how effective or useful college-wide screening 
m nts for conferences is likely to be. Use of the 


al adjustment inventory is bet- 


ling is initiated at the request of 


and bringing in stude 


440 PLANNING A SCHOOL TESTING PROGRAM 


scores by departmental advisers and others without special training 
is hardly to be recommended. 


SUMMARY OF SUGGESTED PROGRAMS 

Let us emphasize again that anv testing program needs to be 
formulated by personnel in the local situation aware of local condi- 
tions, local purposes, and local resources. The proposals that have 
been outlined in this section are at most rough general guides. The 
highlights of this discussion have been organized in tabular form in 
Table 16.2. The most highly recommended tests are marked with 


Table 16.2. Suggested Tests for School Testing Program 


Educational Level 


Type of Test Elementary Secondary College 

General intelligence or scholastic aptitude BEN TN M 
Reading ** ++ 2: 
Basic academic skills ы 

Reading readiness * 

Individual intelligence * 

Achievement in content subjects * ж Ы 
Personality in ventory ? d 
Interest inventory ** sii 
Special aptitude tests or battery "A n 
Prognostic tests for special subject ш 


а double asterisk. Tests considered useful suppleme 
sive testing program or of value for certain special purposes receive 
a single asterisk. Procedures deemed of doubtful value are indicated 
by a question mark. Where no mark is made, it indicates that the 
type of test is considered inappropriate or of little value at that 
level for a class-wide or school-wide te: 


nts in an exten 


sting program. 


CUMULATIVE RECORDS 

A school testing program implies 

a cumulative record system. 
useful over the ye: 


as an almost automatic corollary 


If the records of testing are to be most 
ars, the information for any pupil must be con- 
veniently available in organized form. Of course, cumulative rec- 


ords to not relate only to test results, They refer to all the types of 
information about the pupil that are a matter of lasting interest O 
the school in its guidance activities with the pupil and in its later 


responsibility for supplying information to colleges or employers: 


CONSIDERATIONS IN PLANNING A CUMULATIVE RECORD SYSTEM 441 


Probably every school keeps some records that accumulate during 
the years of the pupil's stav in school. But the existing system may 
not function as effectively as would be desired to provide information 
useful in the guidance of the pupil at all stages in his school career. 

Just as a testing program is a local matter, depending upon local 
needs and resources, so the planning of a record system is a local 
matter. The types of records to be kept, the form and place in which 
they will be kept, and the persons by whom they will be kept must 
be decided in terms of the purposes and orientation of the local 
School, the types of information gathered about pupils, the way 
Which information is used, and the resources of guidance and cleri- 
cal personnel. Planning or reviewing the local record svstem can 
well be a cooperative enterprise involving all types of school per- 
sonnel. The important thing about records is not what is put into 
them but what is gotten out of them. Records are to be used. We 
need to know who uses them, what information is desired and for 
What purposes, and other down-to-earth facts before we can reach 
sound decisions on content, format, location, and servicing. Here 
again, one anticipates that use of the information will increase if the 
potential users participate in determining what is needed in a record 
System. 


's in 


CONSIDERATIONS IN PLANNING А CUMULATIVE RECORD SYSTEM 

The chief considerations in planning a cumulative record system 
аге (1) what is to go in the records, (2) in what form the records are 
to be maintained, (3) where the records are to be kept, and (4) what 
the functions of different persons are to be in preparing and using 
the information. We shall consider each of these issues briefly. 

Content of Cumulative Records. There are many types of informa- 
tion that clamor for inclusion in a systematic set of records. Among 


others, we may note 


1. Basic attendance and pupil accounting information. 

2. Teachers’ appraisals of academic achievement, and grades as 
reported to pupils or parents. 

3. Records of all standardized tests. 
. Anecdotal records and other informal appraisals of the pupil. 
Physical examination and health data. 
6. Records of participation in sports, clubs, and extracurricular 


tn Ф 


and work activities. 
7. Information about home and family. 
8. Information about pupil interests, plans, and objectives. 


442 PLANNING А SCHOOL TESTING PROGRAM 


As a general principle, we may suggest that any information worth 
the time and trouble to gather will be worth making a matter of 
record. à 

The pattern of a record system will be in large measure a reflection 
of the program for studying individual children in the school. How- 
ever, we must take care that the system does not become too bur- 
densome for available facilities. Record keeping is not an end in 
itself. Making the records should not overshadow using them. 

Format of Cumulative Records. \When one plans the forms to be 
used in setting up a cumulative record system, one must compromise 
between a number of distinct and, to some extent, conflicting ob- 
jectives. On the one hand, one wishes the record to include as much 
information as possible and to include it in a form that is easily 
read. One wishes space for many items and ample room for each. 
On the other hand, one wishes a record that is compact and easy to 
file. One wishes a system of records that can be used by different 
people in different places, but one wants the information centralized. 
One wants durability and one wants economy. One wants the cleri- 
cal chores of maintaining the records kept down to a minimum. 

Any actual record system is a compromise of these values. In 
the interest of compactness and case of handling, the basic informa- 
tion is most often assembled on a large card or cards or printed on a 
file folder. Forms have been adapted for tray-type visible filing sys- 
tems, and many schools have found these convenient. Often the 
card file is supplemented with a separate folder or envelope for each 
pupil, which can hold various types of more bulky material, and this 
1$ separately filed. 

If the material on a record form is to be 


tant that it be systematically and logically organized. A sequential 
organization by vears across the card from left to right permits 
following the course of pupil development. Related types of material 


should be presented together in the record so that they may readily 
be seen together. j d 


The part of the form in which stand 
should provide space enough so th 
tion can be included together with the test results. The following 
information should generally be recorded: (1) name of test (2) spe- 
cific form of test, (3) date of testing, (4) raw score, (5) буре of con- 
verted score reported (1.Q., age score, grade score, г 
particular grade group, etc.), and (6) conve 

Location of Cumulative Records. 
housing of cumulative records, we 


used readily, it is impor- 


ardized test data are recorded 
at necessary identifying informa- 


percentile for 
rted score, 


In planning for the location and 
once again face а compromise of 


ILLUSTRATIVE RECORD FORMS 443 


conflicting objectives. Records exist primarily to be used. The rec- 
ords should be placed where all who have legitimate occasion to use 
them can have convenient access to them. At the same time, rec- 
ords must be protected from examination by unauthorized persons. 
The problem of confidential information is often handled by provid- 
ing a separate confidential file, to which cross-reference is made. The 
other records can then best be placed in some central location where 
(1) they are available for use at any time during school hours, (2) 
there is space to work with them, and (3) they are under the eye of 
some authorized person who can take general responsibility for them. 

Functions of Different Staff Members. A cumulative record sys- 
tem should be a cooperative enterprise enlisting the support of the 
whole school community. Pupils should sense that the records are 
planned for their benefit. Ina secondary school, at least, they may 
well be told about the record system as a part of the homeroom or 
guidance program. Part of the information will be supplied by the 
pupil: information about the home, out-of-school activities, interests, 
plans, and objectives. The classroom teacher should be both a 
contributor to and a user of pupil records. He contributes anecdotal 
material and evaluations of pupil growth, as well as grades in sub- 
ject matter. He uses the information already in the record in getting 
a better picture of individual pupils. Testing and guidance person- 
nel contribute data from school-wide or individual testing and use 
the results as background for conferences with pupils. 

A practical problem is that of handling the actual clerical detail 
of making entries and keeping the records up to date. Where special 
clerical help is available, the routine entering of large bodies of ma- 
terial—scores from a school-wide program of testing, periodic grades, 
attendance records, etc.—can well be turned over to these special 
individuals. However, this is not a complete gain. There is some- 
thing to be said for having the user, the classroom teacher in the 
elementary school or the homeroom teacher in the secondary school, 
work closely with the records. If the clerical burden does not become 
“the teacher may gain in understanding of pupils from 


too heavy, 
continuously and more intimately with the records. 


working more 


ILLUSTRATIVE RECORD FORMS 

There are many local variations of record systems adapted to local 
needs. Many of these represent adaptations of forms developed by 
committees of national organizations concerned with the problem of 
records. We reproduce in Fig. 16.1 a form developed by the Educa- 
tional Records Bureau for use by its subscriber schools. A number 


444 PLANNING A SCHOOL TESTING PROGRAM 


NAME M.F. BIRTHDATE—PLACE GEN'L HEALTH 


FATHER 


MOTHER | 


OR GUARDIAN 


LANGUAGE SPOKEN IN HOME TYPE OF COMMUNITY 
BEFORE 10— AFTER 10— BEFORE 10 
YEAR AND AGE 


ADVISER 
ATTENDANCE |A T A Т А 


DISCIPLINE | 


STEP-PARENT | 


НОМЕ | 
INFLUENCES 
AND 
COOPERATION 


MENTAL 
AND 
EMOTIONAL 


— | 


PHYSICAL 
AND 
ATHLETIC 


EXTRA- * 
CURRICULAR 
ACTIVITIES 
AND 


INTERESTS 


ر ا ss‏ —— 


NOTABLE 
ACCOMPLISH- 
MENTS 
AND 
EXPERIENCES 


— 


EDUCATIONAL 
PLANS 


x 
PERSONALITY 
RATINGS | 


E 
HEE ae 
— 


REMARKS 


Fig. 16.1. Sample 


ILLUSTRATIVE RECORD FORMS 445 


) RELIGION | RACE OR wen OCCUPATION ADDRESS 


| 
| 
| 
| 


RES. 

BUS. 
RES. 
BUS. 
RES. 
BUS. 


IF PARENTS SEPARATED 
AFTER 10— GIVE DATE 


T ^ T А T [^ T 


cumulative record form. 


446 PLANNING А SCHOOL TESTING PROGRAM 
NAME 
YEAR 
SCHOOL 
GRADE 
MENTAL AGE 
CHRON. AGE 
SUBJECT 
@ 
© 
= 
52 5 
95 
ш о 
a 
2 
3 
= 
ACADEMIC Test 
APTITUDE 
T 
| TEST 
| 
| 
Eo pa 
158 8 
3885 
luo22 
25 8 
ro & 
гаров 
| 
| 
a 
hile A My Jelly Au STOW Dfa F MIA my ely AS 
о 
S 
8 
2 
a 
© 
oz 
nae 
we 
25e 
с о 
o5 
аё 
E 
283 
Bre 
o8 
"s 
m 
б 
о 
2 
@ 
8 ЭИ ИЧТЕ 
a L 
E 4 
Ж. | 
CUMULATIVE RECORD FOR INDEPENDENT 5 
H me | - 
CHOOLS EDUCATIONAL RECORDS BUREAU 


Fig. 16.1. (Cont) 


ILLUSTRATIVE RECORD FORMS 447 


BIRTHDATE 


SUBJECT | SUBJECT | SUBJECT 


RI 
TEST A o [ie 
TEST SCORE) geile 
Emi 
WA so N 5 A My Je [ly Au SO N 3 F M[A My Je 


437 WEST 59TH STREET, NEW YORK, N. Y. 


Fig. 16.1. (Cont) 


448 PLANNING A SCHOOL TESTING PROGRAM 


i i der's Techniques of 

of other forms will be found illustrated in Traxler's os 

лае asi oca 

Guidance. Illustrations such as these provide a basis for : 

| i eds а 'sources. 
planning but should be adapted in terms of local needs and resol 


PRESENTING THE RESULTS OF TESTING TO 
THE PUBLIC 


When we speak of presenting test results to the public we may he 
thinking of reporting on the performance of the school system Wr а 
whole or of a major segment of it. On the other hand, We may e 
thinking of presenting the test results for a particular individual oe 
specific interested public, the parents or, possibly, the pupil himse : 
Each type of public изе of test results has its place, and each presents 
its problems. We shall give a little attention to each in turn. 


REPORTING THE RESULTS OF A PROGRAM OF TESTING 


Test results are often used as one basis for an over-all appraisal 
of a school and its curriculum, The appraisal may be a within-school 
one, and the report may be planned to stimulate self-evaluation and 
self-criticism. Or the report may be to lay groups, the board of 


М ; " rts the 
education, groups of parents, or the general public that supports t 
schools. 


Whatever the audience, the purpose of the report will be to sum- 
marize, organize, and interpret the test results so that a meaningful 
picture of the school's accomplishments will emerge. Scores will be 
tallied for each class and school, separately for each significant sub- 
test, and for the test as a whole. "The scores of interest will ordinarily 
be converted scores, i.c., age or grade Scores or percentiles for a par 
ticular grade group. Measures of average score (mean or median) 
will be obtained for cach group of interest. These will then be or- 
ganized for the audience in tabular or graphic form, 

In presenting test results to an audience 
various types of comparisons. 
cance include: 


‚же will be interested in 
Comparisons that may be of signifi- 


1. Comparison of local grou 
more specialized norms if they 

2. Comparison of loc 
the group. 

3. Comparison of achiev 

4. Comparison of differe 
ent class groups in a school 


" z sor 
P performance with national norms ‹ 
are available, К 

, lite in 
al group performance with level of ability ! 


cement in different subject areas. _ 
nt schools in a system or possibly differ- 


REPORTING THE RESULTS OF A PROGRAM OF TESTING 449 


5. Comparison of groups taught in different ways—using different 
methods or materials. 

6. Comparison of groups at different grade levels. 

7. Comparison of the same group at different times to show pupil 


growth. 


If an effective presentation is to be achieved, graphic representa- 
tions will be needed as a supplement to the relatively formal and 
forbidding tabular presentation of evidence. Presentation of group 
data falls most naturally into the pattern of the bar chart or of the 
profile (see Chapter 7, pp. 172-182). 

Suppose that we had tested all the 
of an elementary school in April. The results were 


16.3 below. 


pupils in grades 2 through 6 
as shown in Table 


Table 16.3. Median Performance of Each Grade of School W on 
Intelligence Test and Achievement Battery 


(Tested in eighth month of school year) 


Grade Medians 


Test 2 3 4 5 6 

Intelligence test 3.3 4.4 5.4 6.3 7.6 
Word knowledge test 3.7 4.6 5.9 6.6 7.8 
Paragraph reading test 3.6 4.8 5.6 6.5 8.1 
Arithmetic fundamentals test 2.2 3.6 51 6.2 735 
Arithmetic reasoning test 2.5 3.7 5.0 6.4 hist 
Language usage test 3.2 4.3 5.4 6.2 7.8 
2.8 3.0 4.9 5.9 6.8 


Spelling test 


This table shows a school in which the pupils are generally of 
(For example, when the tests were given 
in grade 2.8, the median intelligence grade level was 3.3.) Reading 
achievement is generally superior, above both national norms for 
the grade level and. the intelligence level of the group. Arithmetic 
arly grades but is above in the later 


is below national norms in the ¢ 
grades, approximately equaling the intelligence level. The reader 


can note other details of the table. 
Figures 16.2 and 16.3 show two types o 
of these data. The first figure shows a profile of separate test medians 
The upper profile in the figure shows 
ational norms for the grade level. The 
lation to the ability level for the 
be prepared for other grades. 


above-average intelligence. 


f graphic representation 


for the second-grade group- 
performance in relation to n 
lower profile shows performance in re 


group. Similar profiles would, of course, 


450 PLANNING А SCHOOL TESTING PROGRAM 


Figure 16.3 is a bar chart comparing achievement in arithmetic 
fundamentals in the different grades. In order to include a reference 
standard for each grade, the national norm for the grade and the 
grade level of the group on intelligence are both indicated. This 


40 


А. In relation to grade level 


Grade level 
28 


Grade equivalent 
w 
o 


& = es 2 ы 2 8 
$ Br $i 22 e, г © 
eB BE 8 2 8 а & 5j g 
EZ ES BE Е ER: = & 
= 8 3 = 5 za г a © 
г 8 8 БЕ 58 32 & E 
E B = 5 =? a = 
2 = 


40 
B. In relation to intelligence level 


ES Intelligence level 3.3 


w 
a 


Grade equivalent 
w 
о 
—— 


25 
2.0 
2 
$ = 2s E 
& & zt Er BS & 
ЕЁ Б НЕ 8 5 3g 2 
5+ RS ЕЁ ЕЕ SS = 
= Е £8 2 5 £g RS a 
= = = = 5 E 5 a 
Fig. 16.2. 


Profile of achievement test medians for grade two, school W. 


type of bar chart could, of course, be used for comparing different 


schools, groups taught in different Ways, or other different types of 
groups. 


An effective popular pre 
pictorial material but few t 
odd corners where the 


sentation will contain much graphic and 
ables, hiding these in appendices or other 
y will not distract the 

It is perhaps worth re-emphasi 
brought out in any set of ch 


audience. 
ig here that the test results 
arts constitute only raw facts, not mean- 


REPORTING TEST RESULTS FOR THE INDIVIDUAL 451 


ingful interpretations or conclusions. These interpretations must be 
supplied bv the educator, who is acquainted with the circumstances 
surrounding the test scores. Thus, in our illustration the interpre- 
tation of the arithmetic achievement in the second and third grades 
would be vastly different for a school that had given little formal 


Grade Equivalent 
40 5.0 6.0 70 80 
T F T T 
Е Arithmetic computation- School W 
[Arithmetic computation- National norms 
E intelligence level- School W 


2.0 3.0 


Grade 2 


Grade 3 


Grade 4 


Fig. 16.3. Median arithmetic computation scores by grade for school W compared with 


national norms and grade intelligence level. 


grades than for one that had emphasized arith- 
a false impression of the significance of 
ground facts are not taken into ac- 


instruction in those 
metic. It is easy to give 
test results if meaningful back 
count. 


REPORTING TEST RESULTS FOR THE INDIVIDUAL 
or principal is faced at frequent 


The teacher, guidance worker, | 
rting test results to pupils or to 


intervals with the problem of repo 


parents. He must decide how much to report and in what way to 
report it. 


In particular, the question is often raised whether exact 
Score equivalents should be reported. Should the parent be told 


452 PLAMMING A SCHOOL TESTING PROGRAM 


his child's T.Q. or reading grade level? Should the high-school pupil 
see his percentile rank on a scholastic aptitude test? 

We can perhaps best answer this question by asking another one. 
What information does the parent or pupil need in order to have a 
true picture of the situation or to reach a sound decision? In prac- 
tically every instance we will have to conclude that an exact score is 
not needed, and would be of no practical value. The need is for a 
sound interpretation of the pupil's or client's standing. Except in 
rare instances, this interpretation is better provided by school per- 
sonnel than by parent or pupil. The need is for an interpretation of 
the score in terms of its educational or vocational significance. 

In addition to not needing specific information there is the very 
real possibility that the lay person will misinterpret and misuse it. 
A precision may be attributed to test scores that they do not have. 
Differences may be noted and given significance that they do not in 
reality possess. The importance and scope of the test score may be 
overvalued. Invidious comparisons may be encouraged. 

Generally, then, results of testing a particular individual should 
be reported to the lay public, and especially to the parent or pupil, 
by statements of general level or range of ability: “about average,” 
“somewhat below most of those in his group," "not as well as we 
would expect in terms of what we know about his abilities.“ Empha- 
sis will be on an interpretive reporting of results—on what the 
Scores signify in terms of school progress or vocational plans. Exact 
scores will be played down. Presented in this way, the potential 
misuses of test results are held to a minimum, while their values in 
providing understanding of the pupil are maintained, 

There are three qualities that we would like to see in any inter- 
pretation of a score to a pupil or parent. 


1. The interpretation should be set in the frame of reference of 
the particular pupil. Thus, standardized achievement test results 
should be interpreted in terms of what is known about the pupil's 
aptitude and about his educational and vocational goals. A reading 
score at the 50th percentile of a ninth-grade group means quite dif- 
ferent things for the boy who aspires to go to medical school and for 
the boy who plans to become a mechanic. 

2. The interpretation should be directed toward positive and con- 
structive action. It should emphasize the assets in an aptitude 
profile, or it should be oriented toward remedial action when achieve- 
ment falls below what aptitude would lead one to expect. It should 
point toward possible educational or vocational outlets, even if the 
chosen one may be impossible of achievement. 


SUMMARY STATEMENT 453 


3. It should be factual and dispassionate, rather than appearing 
to pass judgment on the individual. Test results and other evidence 
should be reported truthfully and accurately, but with a friendly and 
accepting attitude. The flavor should be one of working with the 
pupil to realize common goals rather than one of passing judgment 
on him. 


Consideration should be given to making more general provision 
for feeding back to pupils information about their performance on 
tests that they have taken. All too often, pupils take a battery of 
tests and that is the last that they hear of them. Testing will be 
more meaningful and significant to pupils if they can see something 
of the outcomes and uses of the testing. Of course, any report must 
be adapted to the maturity of the group. However, several ways 
of reporting back to the group merit consideration. 


1. The classroom or homeroom teacher might give a simple sum- 
mary report and interpretation of the performance of the group as a 
whole. Strengths and weaknesses could be indicated. Comparisons 
could be made with general norms for the grade level. Some indica- 
tion could be given of the way the results could help the work of the 
group and planning for particular individuals in the group. 

2. An invitation could be extended for pupils to talk over the 
tests with the teacher. In individual conference, pupils could be 
given a general picture of the significance of their test results and 
how they relate to plans for the pupil. 

3. In cases where the test results have important implications, 


ie., large discrepancy between aptitude and achievement, large dis- 
between test performance and what the teacher believes 
a self-picture, the teacher could take the initiative 
discussing test results individually with the 


crepancy 
the child to hold as 
and make a point of 
pupil and perhaps also with his parents. 


SUMMARY STATEMENT 


e in this chapter has been that tests are given 
are determined by local needs. A testing 
be developed for local needs in terms of 
ive participation by the local personnel, 


Our central them 
to be used, and that uses 
program must, therefore, 


local resources and with act 
who will be the users and interpreters of the test results. 

In Table 16.1 we have listed a wide variety of functions that tests 
are often called upon to serve. The discussion of these functions 
may guide a local group in defining their purposes in testing. In 


454 PLANNING A SCHOOL TESTING PROGRAM 


terms of common needs, we proposed tentative priorities for differ- 
ent types of tests. These are summarized in Table 16.2. 

An effective testing program implies an effective system of rec- 
ords, not only for test scores but also for other facts about the pupil. 
This system also should be oriented toward usc. The choice of items 
to be recorded, format of record, place of housing of records, and 
responsibility of maintaining them should be planned so that a maxi- 
mum of use will be made of the records. 

Finally, adequate procedures are needed by which the results of 
testing programs can reach interested publics. Group results need 
to be organized effectively in graphs and charts centering about the 
significant comparisons, and the whole organized into a report that 
can be presented to school board, PTA, and other interested lay 
groups. Presentation of results for individual pupils to pupils and 
parents should be facilitated. However, in this presentation results 
should be in general terms, and emphasis should be on an interpre- 
tive synthesis rather than on specific details. 


REFERENCE 


1. Traxler, A. E., Techniques of guidance, New York, Harper, 1945, 


SUGGESTED ADDITIONAL READING 


Р Allen, Wendell C., Cumulative pupil records, New York, Bureau of Publica- 
tions, Teachers College, Columbia University, 1943. 
Durost, Walter N., M hat constitutes a minimal testing program?, Educ. 
bsychol. Meas., Vol. 7, Spring, 1947, pp. 45-60. 
Я e dae ecards and reports: observations, tests, and measurements, 
“arty childhood education, Forty-sixth Yearbook, Nati Society the 
Study of Education, 1947, pp. 281-313. Doni cus Seas 
19 e р and evaluation Programs, Federal Security Agency, 
ice of Education, Circular No. 320, Washi jos fic Edu- 
raion олы ashington, D. C., U. S. Office of Edu 
ee E., Techniques of guidance, New York, Harper, 1945, Chap- 
5 11 and 12. 
axler, Arthur E., et al., Introduction to testin 
public schools, New York, Harper, 1953, Ch 
United States Office of Education, N 
Records, Handbook of cumulative records, 
Education, Bulletin 1944, No. 5, W. 
Printing Office, 1945, 


te 


g and the use of test results in 
apters 2, 3, 5, 8, and 9. 

ational Committee on Cumulative 
Federal Security Agency, Office of 
ashington, United States Government 


QUESTIONS FOR DISCUSSION 


1. Suppose you are starting to teach 


k C à a new sixth-grade group. Make a list 
of the types of information you would like to have about the new class. For 


QUESTIONS FOR DISCUSSION 455 


which could standardized tests contribute the information either wholly or in 
part? 

2. Assume that you are on a testing committee to plan a testing program 
for your school system. What would you want to know as a basis for planning 
the program? What part should the teachers in the system have in the 
planning? 

3. What is the relationship between planning a testing program and plan- 
ning curriculum revision for a school? 

4. How sound and how generally true is the statement “standardized tests 
are less satisfactory in the content than in the skill areas"? What is the 
basis for your answer? 

5. In a school system where you are in charge of testing, funds make it 
possible to give only one group intelligence test in the elementary school 
When would you give it? What are the reasons for your choice? When 
would vou give one test in the junior-senior high school? 

6. What are the relative merits of giving a battery of achievement tests in 
the fall (October or November), as compared with giving them in the spring 
(April or May)? 

7. In an elementary-school class that you are teaching, an intelligence test 
and an achievement battery have been given in October. In what different 
ways might you as teacher use the test results with your class? 

8. Under what circumstances would it be appropriate to have a school 
mark depend in whole or in part upon standardized test results? 

9. What are the advantages and disadvantages of a uniform state-wide 


testing program, such as the Jowa Every Pupil Tests or the New York State 


Regents Examinations? T 
10. To what extent should standardized test results be reported to parents? 


To elementary-school pupils? То secondary-school pupils? То college stu- 
dents? If they are reported, in how much detail and how precisely should 
they be reported? r А : _ 

11. What should be the role of aptitude and achievement tests in forming 
instructional purposes? In forming special groups of 


classroom groups for 
high or low deviates? А А 

12. School A is experimenting with a new type of core program in the 
secondarv schools. Those who have set up the program wish to evaluate its 
effectiveness. What place should standardized achievement tests such as the 
Essential High School Content Battery or Iowa Tests of General Educational 
Development have in such an evaluation? What cautions should be observed 


in basing the evaluation on these? 

13. School B makes a policy of using the a emen 
one year, the California Tests the next, and the Iowa Every Pupil Test the 
next. What advantages do you see in this policy? \ hat disadvantages? К 

14. How would you describe the shift in emphasis in a standardized testing 


Program as one goes from the fourth grade to the twelfth grade? 


15. Study the cumulative record system in the school where you teach or 
i How satisfactory are the record forms? What addi- 
a would you like to see in the records? How acces- 


How much are they used? 


Metropolitan Achievement Test 


in a school near you. 
tional items of informatior 
sible are the records for use? 


Chapter 17 
v 
Marking and Reporting 


One educational phenomenon closely related to problems of meas- 
urement is that of grades and report cards. In this chapter we shall 
start out by examining the basic nature of school marks. Then 
we shall consider the functions that marks and reporting systems are 
supposed to serve and try to evaluate how well these functions are in 
fact served by various marking and reporting practices. Finally, 
we shall consider several technical problems relating to the assign- 
ment of marks. 

What we shall have to sav in this chapter will not stem to any 
large extent from experimental research. The problems involved 
are largely problems calling for thoughtful analysis of needs to be 
met and the resources available for meeting them. 


WHAT A MARK IS 


A mark or grade in school, college, or university is, in the last 
analvsis, a judgment of one person by another. The judgment may 
sometimes be formulated rather intuitively and subjectively, as when 
the teacher reads and evaluates various literary productions known 
аз essay tests, reacts in an unanalyzed w 


ау to classroom activities 
and participation, and puts in a little cre 


dit for “effort” to leaven the 
mixture. In other cases, the final appraisal may be arrived at quite 
mechanically and objectively, as when the teacher sums certain ob- 
jective test scores with specified weights, arranges the totals in rank 
order, and gives the top 10 percent A. But judgment was involved 
in the second case as truly as in the first. In the second case judg- 
ments were made early in the game: (1) that objective tests provided 
the appropriate basis for arriving at a rank order, (2) that certain 
items represented suitable evidences of achievement, (3) that the 
appropriate responses to these items were thus and so, and (4) that 
456 


WHAT A MARK IS 457 


for this group it would be appropriate to assign the label А to 10 
per cent of the group. 

We see, then, that the manner in which the final appraisals are 
reached differs. Judgment may enter primarily into the planning of 
objective test procedures and may be made for the class as a whole, 
or it may enter into the appraisal of performance by specific pupils. 
The types of evidence judged will vary from class to class, consisting 
in some instances largely of test papers, in others of tangible pupil 
products, and in still others of pupil activity and participation in 
the work of the group. The qualities appraised and the weight 
assigned to each may vary markedly from one instructor or class to 
another. But the basic fact of passing judgment is common. Any 
mark is a judgment of a student by a teacher. 

Furthermore, a mark is ordinarily a relative judgment. The rela- 
tivity of a score was brought out in detail at the beginning of Chap- 
ter 6, when we were discussing norms. It was pointed out then and 
illustrated with a pair of spelling tests that a test score of 75 per 
cent could represent a high score or a low score, depending upon the 
nature of the test and the relation of the test to the person or group 
being tested. This is equally true of marks. What does а mark of 
E signify? In one system it may denote "excellent," whereas in 
another it denotes "failure but subject to re-examination." Here 
the difference lies in the structure and meanings of the two symbol 
systems. However, in other cases the symbol system may be nomi- 
nally the same but actually different. Thus, B+ has quite a differ- 
for the course of instructor X who gives half 


ent real significance 
of instructor Y who gives one tenth A's. 


A's than for the course 

The variation from teacher to teacher and place to place bears 
witness to the fact that no fundamental anchor or reference point 
exists for grades. The standard of reference may be the group itself 
when grades are explicitly igned in terms of proportion of the total 
group, e.g., 20 per cent A's, 30 per cent. B's, 40 per cent C's, and 10 
per cent D's. The standard may be some shadowy inner picture the 
instructor carries with him of the performance by previous classes 
and of the accomplishment that may reasonably be expected from 
individuals at this level of advancement. But in either case, the 
There is no "sea level," no fixed standard 


standard is a relative one. 
to which we can refer any individual's performance. 

The school can never escape making these relative. judgments 
Such judgments will always be necessary for guiding 
for understanding his personal trials 
to plan his educational and occupa- 


about pupils. 
the pupil in his school work, 
and tribulations, for helping him 


458 MARKING AND REPORTING 


tional future, and for cooperating with later schools and employers 
in selecting the individuals who may most suitably be instructed or 
employed. No, the schools cannot escape their responsibility for 
making these judgments. On this we can agree. But we will find 
agreement on little else. 

The issues revolve around the following questions, taken singly or 
in combination: 


When should judgments be made? 
In what form should judgments be recorded? 
What factors should be covered? 
On what sorts of evidence should the judgments be based? 
Who should be responsible for making the judgments? 

6. To whom should they be reported? Who is the appropriate 
consumer of the information? 


о دم‎ — 


eR 


We shall start by seeing who has a need for the information that 
judgment of pupils by teachers is supposed to provide and shall 
try to consider the other questions in the context of the functions to 
be served. 


THE FUNCTIONS OF MARKING AND REPORTING 
SYSTEMS 


Who needs the information represented by the judgments that 
the school makes about a pupil? We can recognize five normal users 
of this information. These are (1) the pupil being evaluated, (2) 
the pupil's parents, (3) some school that the pupil will attend later, 
(4) a potential employer or similar outside community agency, and 
(5) the school itself. If we now try to determine what information 
each of these potential users needs, we should be better able to de- 
cide what sorts of judgments should be made and how they should 
be reported. 


INFORMATION NEEDED BY THE PUPIL BEING EVALUATED 


Why does the pupil need to be informed about the school’s ap- 
praisal of him? What information does he need? Information con- 
veved to the pupil would appear to serve four functions: (1) motiva- 
tion of school work, (2) guidance of learning, (3) guidance of future 
educational and vocational planning, and (4) guidance of personal 
development. For each of these needs, what information should the 
school give him, at what times, and in what forms? 


INFORMATION NEEDED BY THE PUPIL BEING EVALUATED 459 


Motivation. The motivations underlying work in school are mani- 


fold and complex. They include satisfactions in participating with 


the group, in exploring and investigating new fields, in succeeding, 
е and doing what is ex- 


in receiving social approval, in conforming 
pected, and many others. The negative incentives are also quite 
real—of frustration, failure, boredom, or rejection. 


Testing, evaluation, marking, and systems of reporting enter into 


pupil motivation in at least four related ways. 
of success and failure in the day-by-day 


1. Through experiences 
school. 


recitations, tests, and exercises of the 


2. Through the awareness that others are 
at least in part on the basis of that 


appraising his work and 


forming their opinions of him 
work. 

3. Through the awareness that his performance in whatever ways 
it is appraised by the school is becoming a matter of permanent 
record and part of his official past history. 

4. Through the anticipated and actual impact of school reports 
upon his world, the world of family 

Daily experiences of success and failure with the tasks he meets 
in school represent the most direct and intrinsic motivating effect 
of measurement procedures. The teacher sets the tasks by which the 
pupil tries out his abilities and reflects back to the pupil his evalua- 
tion of the pupil's performance. This teacher appraisal, made known 
to the pupil through corrections on papers and through oral com- 
ments, provides a continuing set of cues on success or progress that 
motivate daily activities. А : | 

The importance of success and achievement as motives to sustain 
interest and effort is indicated in both laboratory and clinical re- 
search. The goal of the school should be, therefore, so to set the 
daily tasks and the standard of expected performance for each pupil 
as to provide a reasonable proportion of experiences of success. The 
teacher has the responsibility of knowing the potentialities and past 
performances of each child well enough so that he can adopt and help 
the child to adopt standards of performance that are reasonable for 
him. If the most effective motivation is to be maintained, the school 
must set flexible standards adapted to individual pupils. 

Maintaining the favorable opinion of others represents another 
aspect of school motivation. The pupil values the good opinion of 
adults who are important to him—his parents and, usually, his 
teachers. At higher educational levels, he is aware that the impres- 
sion of him held by his instructors may be important when they are 


and classmates. 


460 MARKING AND REPORTING 


later called upon as references to support his application for higher 
education or a job. The motivation to be thought well of by teach- 
ers operates no matter what formal reporting or marking procedure 
is used. 

Associated with an awareness that his reputation with his teachers 
is important is an awareness that his academic record may be im- 
portant: the record that follows him in transcripts through further 
education and out into the world of work. Motivation to leave a 
good record is another aspect of the picture. 

Periodic grades and report cards represent administrative devices 
to write into the record the evaluations that teachers are making of 
pupils and to transmit them to pupil and parents. What further 
value do these reports have in motivating student learning, over and 
above the awarenesses to which we have referred in the preceding 
paragraphs? 

One cannot be too sure. It is undoubtedly true that the prospect 
of the report card stimulates some additional effort on school work 
bv some pupils, especially in the higher grades. How widespread, 
how profound, and how effective this additional effort is is another 
matter, however, and is a point on which we have little, if any, 
direct information. Evidence on the point is largely indirect, stem- 
ming from the experience of those schools that h 
the traditional forms of marks and marking. There is no indication 
that these schools find the motivating of pupils a more serious 
problem than do the more conventional and  marks-oriented 
schools. 

Informal contacts between school and parent, between teacher or 
counselor and pupil can serve to transmit the type of appraisal that 
is commonly recorded in a mark. At the same time, an adequate 
program of testing and cumulative records guarantees that the 
pupil's performance is a matter of record. In ‘this setting, periodic 
marks may well be superfluous as Motivating devices. | 

In addition to harboring some doubts as to the effectiveness. of 
marks as motivating devices, many educators would raise serious 
questions as to the desirability of the motivation they provide. The 
motivation of the marks given in the traditional school has been 
criticized as being (1) individual and competitive, emphasizing indi- 
vidual achievement and superiority at the expense of cooperation 
and joint achievement, (2) extraneous and not related to any genuine 
purposes or needs for learning things for their own sake, апа (3) а 
barrier to organizing the work of the school around genuine interests 
and pupil needs. 


ave done away with 


INFORMATION NEEDED BY THE PUPIL BEING EVALUATED 461 


We would conclude that frequent opportunities to test himself 
against the tasks of the school program and prompt evaluation of 
his performance are important and functional aspects of the motiva- 
tion of the pupil in school, but that as far as motivation is con- 
cerned a periodic grade is a questionable addition to this regular 


and immediate appraisal. 
Guidance of Learning. Information about his successes and his 


errors is important also for guiding the pupil's learning activities. 
it is prompt information that is important—prompt 
Research has shown repeatedly the value 
errors or deficiencies and of prac- 


Here, again, 
and, preferably, specific. 
of specific identification of pupil 
tice directed toward replacing those errors with a correct perform- 
ance. In handwriting, practice on those letter combinations that 
produce illegibility in his handwriting; in English, practice to elim- 
inate the punctuation or capitalization errors found in his writing; 
on the words Ле has misspelled represent efficient 
Techniques for diagnosing individual weak- 
nesses promptly so that they may be identified and corrected repre- 
sent key tools in guiding the learning process. Testing serves this 
function in proportion as the tests provide diagnostic cues and those 
cues are fed back to the pupil to direct his learning activities. 

As far as the guidance of learning is concerned, the summary ap- 
praisals represented by periodic marks would appear to have no 
function. They are removed in time from the actual learning 
activities, global, and quite lacking in diagnostic value. Marks 
ct day-by-day learning activities in the school. 
ducational and Vocational Plans. To make 
intelligent, realistic plans about his educational and vocational fu- 
ture, the student needs, among other things, a realistic understand- 
| of educational achievement. He needs this informa- 
s of decision and action. However, deci- 
radually, so that all times are suitable 
a realistic picture of himself in rela- 


in spelling, practice 
learning procedures. 


cannot help to dire 
Guidance of Future E 


ing of his leve 
tion most crucially at time 
sions and plans crystallize g 
for reflecting to the individual 
tion to his educational and vocational expectations. 

Periodic marks, regularly and systematically reported, do repre- 
sent опе way of reporting to the individual judgments about him 
that are significant for his educational and vocational planning. The 
relationship of judgments at an earlier level (e.g., high-school marks) 
to judgments at a later stage (e. g., college marks) has been demon- 
strated repeatedly. As the individual considers whether he should 
plan to go to college, seek professional training, or spire to a pro- 
fessional job, evaluations of his current academic success represent 


462 MARKING AND REPORTING 


important guideposts. The question is how that academic achieve- 
ment can best be appraised and how the appraisal can best be com- 
municated to the student. 

With respect to manner of appraisal, one possibility is to rely 
upon the judgment of individual teachers, based on whatever types 
of evidence they may sce fit to use in arriving at their judgment. 
Another possibility is to make the single inclusive judgment that a 
particular standardized test or group of tests gives a sound appraisal 
of prospects for success. Tests have the great advantage that the 
score scale has a uniform meaning independent of the particular 
teacher or the particular school. Teachers’ judgments based on a 
class setting have, perhaps, more in common with future teacher 
judgments also based on class activities. However, there are data 
showing that as far as statistical prediction is concerned, a thorough 
achievement battery at the high-school level, such as the Iowa Tests 
of General Educational Development, gives as good a prediction of 
college success as the full four-year record of high-school 
Thus, at least through the high school, 
that provided by grades could be 
results. 

Reporting academic achievement to the 
periodic report card is a rather unsatisf. 
to him the significance of this evide 
does a B average at Centerville Hig 
able success at State University or prospects of becoming an engi- 
neer? Some translation or interpretation is needed, and the student 
cannot be expected to make it for himself. The important thing as 
far as the student is concerned is the meaning of his record rather 
than the raw grades themselves. For a standardized test score, it 
is even more true that we need to provide the pupil with an inter- 
pretation rather than with the raw facts, 
would make it their policy to report to the student only the inter- 
pretation, fecling that the bare test scores would be without meaning 
to the student. But for the guidance of the pupil, this meaning must 
also be put into the teacher’s evaluation of cl 
are at the best raw materials for formul 
tional plans. The raw material nee 
by someone with greater breadth 
pupil. Teachers and counselors shou 
of their students to see what de 
have at the State University 
appear to find and hold suce 


grades. 
a prediction equivalent to 
obtained from standardized test 


pupil only through a 
actory way of communicating 
nee for his own plans. What 
h signify with respect to prob- 


Here many counselors 


ass performance. Grades 
ating educational and voca- 
ds to be digested and appraised 
of experience than the typical 
ild do at least informal follow-up 
gree of success a B student tends to 


or what sorts of jobs the C students 
fully. 


INFORMATION NEEDED BY THE PUPIL BEING EVALUATED 463 


Guidance of Personal Development. We learn how to live with 
other people by the way other people react to our actions. What is 
acceptable and desirable behavior is defined by our group. The 
school also has a responsibility of reflecting to the individual pupil 
its appraisal of his ways of acting, so that he may develop in socially 
desirable directions. This reflecting is implicit in every contact that 
school personnel have with the student. Every expression of inter- 
est, approval, enthusiasm, or the reverse is a subtle guidepost indi- 
cating the way the school community defines desirable behavior pat- 


terns and appraises pupils. These appraisals can be formalized by 


anecdotal records or by periodic use of rating scales or check lists 
to provide a permanent record of the appraisal of the individual 
pupil. It seems doubtful, however, whether such periodic appraisals 
are likely to be really understood by the pupil or to have much 
The direct person-to-person contact in the school, 


impact on him. 
ay interaction but also in special conference as 


largely in day-to-d 
that seems needed, is probably the way in which the school exerts 

whatever effect it can upon personal growth and development. 
Relation of Self-Evaluation to School Evaluation. Certain progres- 
sive educators have contended that evaluation of the student should 
be primarily self-evaluation. It is contended that the pupil should 
set his own goals and should evaluate his own progress toward them. 
al to be said for this point of view as far as some 


There is a good dec 
of the functions of evaluation are concerned. The pupil's own goals 


certainly represent more intrinsic motivators than arbitrary and 
Guidance of his learning activities can perhaps be 


external goals. 
ference to the objectives he himself has 


satisfactorily achieved by re 
set up. 

There are many settings, however, 
The usual pupil lacks the background to 
judge whether his abilities and accomplishments justify planning for 
college education or a professional carcer. Furthermore, he is an 
interested party to those evaluations. We could hardly expect. the 
medical school applicant to be an unbiased judge of his own suit- 
ability. for admission, or the job applicant to give his projected em- 
ployer a completely dispassionate evaluation of his strengths and 
Self-appraisal is not a substitute for school appraisal. 
seen rather little use for marks or a report 
the student of the school's appraisal of him 
rection of learning can best be 


in which student self-appraisal 


seems quite out of place. 


weaknesses. 
In summary, we have 
card as far as informing 


Motivation and di 
and tests of school itself. Guidance of personal 


d in direct personal contact. Marks 


is concerned. 
served by the tasks 
development is best provide 


464 MARKING AND REPORTING 


may have a function in guiding educational and vocational plan- 
ning, but this function can also be served, and perhaps more effec- 
tively, by the uniformly based information from standardized tests. 
Either type of information needs to be interpreted to the pupil in 
relation to particular educational or vocational objectives. 


REPORTING SCHOOL APPRAISALS TO PARENTS 


Marks and report cards have had as one of their main functions 
that of maintaining communication between the school and the 
home. Effective communication between these two chief agencies 
concerned with the growth and development of the child is vital for 
his best development. The school needs an adequate picture of 
cach child's home: the physical circumstances under which he lives, 
the family constellation, the attitudes of parents toward him, and 
parental goals and aspirations for him. At the same time, if parents 
are to cooperate with the school and to work intelligently for the 
child's good, thev need to know what the school is trying to accom- 
plish, what problems it is encountering in the case of their child, 
and what plans and programs are reasonable for that child. 

Two-way communication of the tvpe just described is an ideal 
that is only likely to be achieved in an alert and enterprising school 
and with cooperative parents. It does, however, define. the goal. 
What place do marks and a report card have in moving toward such 
a goal? 


Fundamentally, the objectives of communication between school 


and home and home and school are that both agencies shall under- 


stand the pupil better. Reports from the school to parents have as 
their objective letting parents know how 
personnel in the school setting, so that 
able to 


the pupil appears to school 
the parents тау be better 


1. Accept, support, and strengthen the 
lems of growing up. 

2. Understand and cooperate in the 

3. Adopt realistic and constructiv 
goals for him. 


child as he meets the prob- 


School's program for the child. 
e educational and vocational 


In appraising any reporting system, we 
these objectives. 

The Traditional Report Card. The traditional school grading sys- 
tem and report card measure up rather poorly by these standards. 
A parent is not encouraged to understand, support, and feel closer 
to a child by a string of 75's and 80's or A's, B's, and C's. They 


must ask how well it achieves 


REPORTING SCHOOL APPRAISALS TO PARENTS 465 


are coldly and impersonally evaluative. They provide no interpre- 
tation of strengths and weaknesses, no relating of performance to 
potentiality, no encouragement to “accentuate the positive." For 
many children, the effect of the report of grades is to alienate parents 
from their child, rather than to enlist them in his support. Failure 
by parent standards, which are often quite unrealistic, is frequent. 
The popular humorist has long recognized the role of the report 
card as a destroyer of family morale, and though we cannot accept 
caricature as a picture of life, we must recognize the core of truth 
in this disruptive picture of the report card. 

of numerical or letter grades may have some value in 


A report 
helping the parents t 
are low, this may suggc 
versa. But as in the case 
tives, here also grades need inte 
little notion of what significance a particular set of grades at a par- 
ticular school level has for success at higher academic levels. Fur- 
thermore, the appraisal of achievement as seen by the teacher needs 
evidence from measures of scholastic 
«d achievement tests if the most ade- 
of future academic prospects. 
translation of marks and standard- 


o adopt realistic goals for each child. If marks 
st that a limited objective be set, and vice 
of the pupil's own definition of his objec- 
rpretation. The typical parent has 


to be combined with other 
aptitude and from standardi 
quate appraisal is to be made 

It must be admitted that the 


ized test scores into a prognosis of 1 
d personnel with all the evidence available. 


onlv a school report card is in a 
much poorer position still to make a sound judgment. If the school 
is to help the parent plan realistically for the child, it must provide 
interpretations of his academic and vocational potential based on all 
the available evidence. These 
veyed by a report card. 

We see, then, that the traditi 


satisfactorily the needs of the sc 1 
It is a one-way message lacking the interpretation and the orienta- 


tion toward constructive attitudes and constructive action that are 
needed in such a communica But to be dissatisfied with marks 
s not guarantee 
home. 


ater achievement is not easy to 


make, even by traine 
But the untrained parent given 


interpretations cannot well be con- 


onal report card serves rather un- 
hool to communicate with parents. 


tion. 
an easy solution to the problem 
of communication with the Schools have tried many experi- 
for an ideal medium of communication without 
from all points of view. We тау 


and report cards doc 


ments in their search 
finding any that is satisf 4 
consider briefly some of the variations that have been tried. " 
Modification of the Number and Meaning of Marks. The traditional 
report Card, as WE are using that term here, reports grades in subject 


actorv 


468 MARKING AND REPORTING 


As goals are broken down into more and more specific outcomes, 
the number of such outcomes increases rapidly. A report may well 
call for 40 or 50 separate judgments of cach child. This tends to 
become burdensome for the teacher who must make them. It is 
both burdensome in clerical detail and exacting in the thoroughness 
of knowledge of the pupil that it calls for. It is also frequently con- 
fusing for the parent who receives the report. Particularly in the 
case of the less well-educated parent, the complexities of format and 
the volume of detail are likely to be overpowering. 

When a report form is based on a detailed analvsis of objectives 
and covers many specific points, the categories for evaluating cach 
point are usually quite broad and simple. In the illustration we gave 
there were four categories of response, but the number is often less. 
However, phrasing these alternatives so that they present clear and 
significant evaluations of a child is a problem. This is true especially 
when the attempt is made to take account of individual aptitude 
and to appraise the pupil in terms of his potential achievement. 
How, for example, do we use the category “capable of doing better” 
in the set of four categories in our illustration? Do we use it for the 
slow-learner who is not keeping up with the group? For the above- 
average youngster who is only doing average work? Or for the very 
able child who is doing better than most of the group but not as 
well as we believe he could? Any mechanical evaluation system 
that tries to take account of both accomplishment t 
runs into some such difficulties. 

All in all, an analytic check-list type of re 


and potentiality 


port form appears à 
commendable move toward providing the parent detailed informa- 
tion about the child. If it can be interpreted to parents so that thev 
will not feel confused and frustrated and if the burden for the teacher 
can be distributed, it can represent a partial solution. In the de- 
velopment of such a form, it is important to ke 
the types of appraisals it would be nice to h 
but also the types of observations that a te 


ep in mind not only 
ave about each pupil 
acher can be expected 
the form with evalua- 
tions for which the teacher will have no adequate evidence. 

Teacher-Parent Conference. From many points of view, the ideal 
way of maintaining communication between the School and the home 
is to have periodic conferences between parent and teacher, home- 
room teacher, counselor, or other school representative, Such con- 
formal communica- 
tions. On the other hand, the school may elect to rely entirely upon 
this channel. The advantages are several: 


to make. There is little point in burdening 


ferences may, of course, supplement other more 


REPORTING SCHOOL APPRAISALS TO PARENTS 469 


1. Communication is flexible. It can be adapted to the needs of 
the particular сазе. The teacher may emphasize whatever particular 
point she wishes to get over for this particular pupil. 

2. Interpretation is possible. The report can indicate not merely 
isolated facts but interrelationships and evaluations of these facts. 
Long-range plans can be considered in the light of evidence available 
to date. 

3. The communication is а two-way affair. The school cannot 
only inform parents but learn from them. It is possible to bring to- 
gether the two views of the child and see how each illuminates the 
other. Home facts can throw light on school facts. and vice versa. 

4. Misunderstandings can be cleared up. If the parent appears not 


to understand the school's message, it can be clarified by further 


discussion. 


on conferences as the chief or only 
h the home are chiefly practical ones. 
skill of teachers and cooperation by 


The difficulties in relying 
method of communicating wit 
They are factors of time and 


parents. 


1. Time. If a program of parent-teacher conferences is to func- 
tion really effectively, the teacher must have time for several unhur- 
ried conferences with parents of each child during the course of the 
year. Some planning for the conference is desirable so that the 
teacher may have in mind the points to be discussed. Furthermore, 
notes on the conference would be a useful item to have in the perma- 
nent record folder for the child. A record of points discussed, infor- 
mation obtained from the parent, or impressions of the parent and 
available for reference. This means that an allow- 
time per pupil per year is not exces- 
al provision to release teacher time 


home should be 
ance of 2 or 3 hours of teacher 
sive for this function. Some speci 
may need to be made if time is to be available for this purpose. 

2. Spill. Carrying out effective conferences with parents calls for 
special skills on the part of the teacher. A parent conference. is 
essentially a counseling situation. It involves skills in establishing 
bathy- a sensitivity to the other individual's feelings 


rapport, em 
and ability to 


and point of view— z | 
particular audience. These skills can be improved somewhat with 


practice and training, but large individual differences in skill will re- 
main. The effectiveness of a conference system depends on the per- 
sonal security, sensitivity, and understanding of the teachers who 


will hold the conferences with parents. 


adapt a communication to the 


470 MARKING AND REPORTING 


3. Parent cooperation. Some parents are eager to come to school 
and meet with their child's teacher or guidance counselor. Others 
are very hard to reach. The reasons are varied: both parents work- 
ing, inconvenience in leaving the family, lack of interest, or hostility 
toward the school, but the net result is that there will be parents 
who do not come. This problem has been reported to become more 
serious as time goes by. Cooperation for a first interview is rela- 
tively good. But as the novelty wears off and the parent gets no 
dramatic information from the interview, interest tends to wane. 
Fewer parents come back for second and third conferences. Since 
those who fail to come are often the ones with whom the school par- 
ticularly needs to communicate, it seems unlikely that interviews can 
serve as the only, or perhaps even as the basic, medium of communi- 
cation for most schools. If a school has a very active and devoted 
parent group, this procedure may work successfully, but in other 
schools conferences would appear to serve best as an important sup- 
plement to some other communication medium. 


Informal Teacher-to- Parent. Letters. An informal letter has some- 
times been used to replace or supplement the conference with the 
parent. The letter is probably less time-consuming, and the parents 
of each child can be reached more surely. 

Some of the same flexibility that seemed so desirable in the inter- 
view is still present in the letter. But many of the advantages of a 
face-to-face conference are lost. There is no opportunity for inter- 
change of ideas or clarification of misunderstandings. Negative or 
critical notes seem more absolute and threatening when ‘they are 
committed to paper. Furthermore, writing good letters to parents 
is an art at which some teachers have limited skill. Phrasing а 
letter that will convey the desired tone and message and clicit 
understanding and cooperation is not easy. The task can also be- 
come quite burdensome. 

The burden of letter writing can be eased somewhat if it is spread 
out in time. And for any reporting system we may well ask why 
all the reports need to go out on the same date. ; i 
no good reason other than tradition and administrative convenience. 
Spreading reporting over a period of several weeks will serve to break 
some of the traditional report card attitudes and will also сазе the 
burden on the teacher. 


There seems to be 


Summarizing, as we review the possibilities for communicating 
with parents the most desirable way of doing this appears to be by 


face-to-face conference. Because of practical difficulties in schedul- 


APPRAISALS TO INFORM INSTITUTIONS OF HIGHER EDUCATION 471 


ing such conferences and getting parents to attend, some supple- 
mentary form of communication will usually be needed. Provided 
teachers can be given help in developing skills to prepare them and 
time to do the work, the best type of supplement is probably a per- 
sonal letter from teacher to parent. If this is not practical, some 
type of check list will permit a certain amount of fullness and flexi- 
bility of report. The most sterile form of report, as far as helping 
the parent to help the child, is probably the traditional form of re- 
port card. However, in secondary or higher education if no one 
person knows the student well enough to organize a more integrated 
and understanding report of his progress, we may have to be content 
with traditional report forms. No revising of forms will substitute 
for intimate acquaintance with the pupil. 


SCHOOL APPRAISALS FOR THE INFORMATION OF EDUCATIONAL 

INSTITUTIONS AT HIGHER LEVELS 

Colleges maintain records (in part, 
graduate and professional schools; high schools for the benefit of 
ementary schools for the benefit of high 


at least) for the benefit of 


colleges; and, presumably, el 
schools and kindergartens for the benefit of elementary schools. It 
is entirely reasonable that an institution at a higher level expects 
hich it recruits its students to provide it with 


the schools from w 
It is reasonable that the 


appraisals of candidates for admission. 
higher schools expect the lower ones to provide as valid indicators 
of probable success in the later school as thev can. 

Since the criterion of success at the higher level is based in con- 
siderable measure upon ability to master the more abstract and de- 
manding studies at that level, a good deal of emphasis can legiti- 
mately be placed on aptitude for studving abstract subject matter. 


The questions we face center largely around the most useful infor- 


and the best way in which to convey it. 
ly to be useful is information about scholastic 
For information about apti- 


mation 
Information like 


aptitude and scholastic achievement. 
is little competition with standardized tests of scholastic 


acher judgments, based on a limited sample 
bstitute for objective performance ex- 
pressed in a uniform way by reference to a common table of norms. 

In the area of achievement, a case can be made both for informal 
tests. Teacher appraisals 


tude, there 
aptitude. Subjective te 
of cases, are an inferior su 


teacher appraisals and for standardized 
adaptable to a wide range of activities: to skills 
aboratory and workshop activities, and skills 
to mention only a few. However, 


are more flexible and 
of oral participation, 1 
of composition and expression, 


472 MARKING AND REPORTING 


here also the objectivity and uniformity of standardized tests in the 
basic skill and content areas are strong arguments for their use at 
least to supplement the teacher's own appraisals. Standardized 
tests of reading skills, arithmetical and mathematical skills, and 
knowledge of science and social studies have been found to predict 
success? in later education with accuracy that compares favorably 
with teachers' grades. 

It is hard to know how much detail with respect to academic 
achievement a higher educational institution can usefully use. Most 
studies of grades as predictors deal with an over-all measure of aca- 
demic success as a predictor. Some types of specialized schools 
probably tend to weight certain arcas more heavily than others 
because those areas are more close 
mathematics and science by enginec 


related to their curricula, e.g. 
ring schools, history and English 
by law schools. For these special purposes, a report of success in 
particular subjects appears to have some meaning. But in most in- 
stances, what is needed is a summary аррг 


al of total performance, 
i.e., percentile rank in class or some similar statistic. 

It is possible for colleges to get along very happily with no report 
of formal grades from secondary schools 
specific content for the sc 
strated in the special 


and with no prescription of 
condary-school program. This was demon- 
ght Year Study carried out under the aus- 
pices of the Progressive Education 


Association.! In this study 
thirty secondary schools were exempted for the period of the experi- 
ment from all formal course and credit requirements, and a group of 
colleges agreed to accept graduates from these 
of the principal's recommendation alone. 
lege indicated that the graduates did 
students of comparable ability from 
recommendation represented 


schools on the basis 
Follow-up studies in col- 
as well, on the average, as 
other schoc The school's 
an adequate basis for admis 

The schools in the Fight Year Study 
supported schools. They и 
ardized testing. 


sion. 

were well-staffed and well- 
carried out extensive programs of stand- 
In addition, a number of special appraisal devices 
were developed for and used in the 


study. Cumulative records were 
unusually complete. 


Evaluation within this freer setting needed to 
be more, rather than less, thorough than usual. But given this 
careful and continuing evaluation by a competent staff, the judg- 
ment of the school staff, rendered by the principal constituted a 
satisfactory basis for indicating to the colleges which pupils could 
be expected to succeed with college work. 

In conclusion, then, in order to select 


students for admission, 
higher-level schools need the information that schools at the lower 


SCHOOL APPRAISALS FOR THE SCHOOL'S OWN USE 473 


levels can supply about the pupil. However, class grades are only 
one form for such information. Standardized test scores have cer- 
tain advantages over grades, and a comprehensive appraisal by the 
lower school is another possible substitute. Grades have value chiefly 
in those cases in which adequate objective appraisals are not avail- 
able and serve as a reasonably satisfactory substitute for such ap- 
praisals. Their accrediting function increases in importance as We 


go up the educational ladder. 


SCHOOL APPRAISALS FOR THE BENEFIT OF EMPLOYERS 


both to the student and to society, the schools must 


As a service 
tudents to potential em- 


stand ready to provide information about s 
ployers. The information that an employer believes he needs will 
vary from case to case. Often the employing or placing agency will 
have its own recommendation form. The school administration or a 
particular staff member will fill this out as accurately and completely 
as available information permits. School records that permit an 
adequate and objective reply to such inquiries are to be desired. 
However, detailed information on specific phases of academic achieve- 
ment is not likely to be required in many cases. Asummary appraisal 
of level of performance will usually suffice. 

The employer will often want information about non-academic as- 
This is where the school is hardest pressed to 
Perhaps the most important resource is 
good factual information about out-of-class activities: club member- 
ships, offices held, and other evidences of participation or respon- 
sibility. If anecdotal records have been maintained, some judicious 
use of these may supplement the record of activities. Beyond this, 
appraisals one is largely thrown back on ratings 
sho may still remember the student. 


pects of the student. 
provide useful answers. 


for non-academic 
by teachers and staff members W 
These possess the usual limitations that plague rating procedures, 
with the additional factor of forgetting thrown in. Retrospective 
a student who sat is the back row 5 years ago are not 


ratings of 
likely to be models of precision. 


APPRAISALS FOR THE SCHOOL'S OWN USE 

rable extent, à school's appraisals of its pupils are to 
tter job of teaching and guiding the 
pupils. To understand Johnny, who is now in the sixth grade, it is 
useful to know what Johnny was like in the fifth grade, the fourth 


grade, and on back to the school's first contact with him. Whether 
th grader is presenting serious problems or is getting 


SCHOOL 


To a conside 
help the school itself to do a be 


Johnny as a six 


474 MARKING AND REPORTING 


along very well, the present situation can be better understood if its 
roots in the past are known. 

Here again, a variety of appraisals are possible. For broad skill 
and content areas, use can be made of standardized tests. For 
classroom work as scen by the teacher, a written comment and 
evaluation may become part of the cumulative record. Or a letter 
or number grade may be recorded. The class grade is a clerically 
convenient way of summarizing the judgment of the pupil's work. 
What it gains in compactness, convenience, and ease of manipulation 
it loses in vitalitv and descriptive detail. However, practical ad- 
ministrative routines may require this convenient, compact single- 
letter or single-number summary appraisal. 

Аз a matter of realistic fact, in secondary or higher education, the 
instructor may not know the student well enough to give more than 
the most general appraisal. Classes may be large, contact with stu- 
dents slight, and evidences of student accomplishment limited to 
performance on tests, exercises, and papers. When the instructor 
knows the pupil only through limited samples of academic work, it 
is unrealistic to expect anything more than a summary global rating 
of achievement. In this case, a report in one of a few broad cate- 
gories is probably as much as we can realistically hope to get. The 
use of letter grades is then а frank acknowledgment of our limited 
bases for appraisal of the student. 

If the school's records are to be most helpful to the school in deal- 
ing with the pupil, they will be (1) continuous and cumulative, 
(2) descriptive and diagnostic, (3) comprehensive, and (4) organized. 
Programs of school records were discussed in Chapter 16 in connec- 
tion with testing programs. In such a program, grades for subject- 
matter achievement will have at most a small place. 


TECHNICAL ASPECTS OF GRADING AND MARKING 
PROCEDURES 


In the previous section we have expressed a number of reserva- 
tions about the suitability of traditional grading and reporting prac- 
tices for the purposes for which they are used. However, in many 


schools and for some time to come practical administrative consid- 


erations may dictate that grades be given and reported. If this is 
to be done, it should be done in as competent and workmanlike a 
manner as possible. The present section will be devoted, therefore, 
to improving the workmanship in assigning erades if the basic deci- 
sion to give them has been made. We shall give some consideration 


FACTORS TO BE INCLUDED IN A COURSE GRADE 475 


to (1) the factors to be included in arriving at a mark, (2) the types 
of evidence to be considered, (3) the weighting of components of 
evidence, and (4) the final distribution of grades. 


FACTORS TO BE INCLUDED IN A COURSE GRADE 

When a course grade consists of a single number or letter, what 
factors should be taken into account in arriving at the final evalua- 
tion? In particular, should the grade reported be specifically re- 
stricted to the level the student has reached in the outcomes that 
the instructor hopes to achieve in the course? Or should considera- 
tion also be given to the student's potentialities, to attitude and 
effort, to amount of improvement, to writing and speaking skills? 

The answer to this question can only be given in the light of a 
decision as to the functions which the grade is to serve. It is our 
conviction that a grade functions most adequately 


1. To help guide the student and his parents with respect to 
future educational plans. 

2. To help a school decide upon a pupil's readiness to enter cer- 
tain selective programs or courses, e.g., a college preparatory pro- 
gram or a special mathematics or science course. 

3. To help higher educational levels to appraise an applicant's 
acceptability for the program they present. 

4. To help a potential employer to decide upon the suitability of 
the student for certain jobs that depend upon academic skills. 


These purposes would appear to be served best by a grade that is 
as pure and uncontaminated a measure of competence in the field 
as can be obtained. 

A grade that is a pure measure of competence has a unity of 
meaning that makes it interpretable and usable. The grade in 
algebra expresses competence in algebra. The algebra teacher may 
know other equally important things about the pupil: that he is 
highly intelligent, that he is disinclined to work on class assignments, 
that his written work is barely legible, and that he speaks and writes 
It may be desirable that these impressions also 


ungrammatically. 
But they should not be concealed within the 


be a matter of record. 
algebra grade. A grade that is based in unknown proportions upon 
effort, attitude, and collateral skills is an uninter— 


competence, 
A grade is not a tool of dis- 


pretable and unusable hodgepodge. 
cipline to be used as a reward or punishment, but a record of achieve- 
ment. If a single grade is to be reported, it should represent a single 


476 MARKING AND REPORTING 


thing, the best available estimate of competence in the field of 
instruction. 

What is meant by competence in the field is something each 
teacher or group of teachers must determine. We saw in Chapter 3 
that the first step in preparing a test in any field is that of defining 
the objectives of instruction and translating those objectives into 
behaviors that can be tested. In the same way, the first step in 
deciding what should be represented in a course grade is to define 
the several objectives of the course and decide upon the relative 
emphasis to be given to each. What weight is to be given to acquisi- 
tion of information? To understanding of concepts and generaliza- 
tions? To ability to apply information in solving problems? То 
ability to organize and express ideas? То laboratory or manipula- 
tive skills? Though in any course the grade should constitute an 
appraisal of competence, the specific clements going to make up 
competence must be identified specifically for each course. 


TYPES OF EVIDENCE AND THE WEIGHTING OF EACH 


Once the instructor has defined his objectives and decided upon 
the weighting he wishes to give to each in his over-all evaluation of 
competence, he must determine what evidence he wishes to gather 
to represent each objective and how the different types of evidence 
are to be weighted in his composite appraisal. He must allocate 
weights as between a comprehensive terminal examination; exer- 
cises, quizzes, and partial examinations throughout the course; papers 
and reports prepared out of class by the student; participation in 
classroom discussion and other classroom activities; and performance 
observed or tested in laboratory or workshop activities. 

Я Terminal Examination. Those who would give major weight to a 
final examination at the conclusion of a course argue that the sig- 
nificant appraisal of the individual's competence 
instruction is what he can do at the 
better, after some lapse of time. 
achieved his competence is deemed of no importance. Learnings are 
considered of no value unless they persist at least to the end of the 
course. Terminal ability is deemed the important ability. 

The University of Chicago has carried this viewpoint through to 
its logical conclusion and certifies achievement in the major areas 
covered in the first two years of college entirely by comprehensive 
examinations. Comprehensive certifying examinations also play a 
central role in evaluation of students in many programs of graduate 
training. State-wide testing programs, such as the New York State 


nee in a segment of 
end of that instruction or, even 
When and how the individual 


TYPES OF EVIDENCE AND THE WEIGHTING OF EACH 477 


Regents Examinations, are based on this same point of view that 
final level of ability is the crucial appraisal. 

As far as evaluation of the pupil is concerned, it seems hard to 
object to the point of view that final competence is the important 
competence. However, we will find less disposition to accept a ter- 
minal comprehensive examination as the only, or perhaps even the 
primary, basis for determining a final grade. Objections are on 


several grounds. 


1. It is impossible to appraise certain types of competence within 
the limits of a scheduled examination. Ability to find and organize 
materials in relation to a problem, ability to demonstrate certain 
skills—whether of using a microscope or of baking a cake—and 
ability to participate effectively in a group discussion or group proj- 
ect are examples of outcomes not adapted to appraisal in a scheduled 


examination. 

2. The sample 
tion of a practical le 
praisal will be correspondingly restricted. Including evidence avail- 
able from other sources may permit a more reliable appraisal. This 
will be true if the additional evidence is of as high quality as that 
provided by the examination. However, both quantity and quality 
borne in mind if reliability of appraisal is to be 


of behavior that can be obtained in an examina- 
ngth is limited, and the reliability of the ap- 


of evidence must be 


a maximum. 
3. A sample so limited in time may do injustice to certain indi- 


viduals. Certain examinees may be ill, tired, under pressure from 
outside circumstances, or below par for other reasons at the time of 


the examination. Their performance at a particular day and hour 


may fail to represent their usual level of performance. 

4. Performance under examination pressure may fail to represent 
the individual's competence under more relaxed and normal life 
An examination is inevitably a somewhat stressful 
The stress is heightened in the case of a single major 
the outcome of which has important effects upon the 
People differ in the extent to which they are 
Though a case can be made for basing 
ability to perform under pressure, it is 
on performance under stress 


conditions. 
situation. 
examination, 
individual's future. 
disturbed by such stress. 
an appraisal in part on 
hard to justify basing it completely 
conditions. 

5. The crucial terminal examination ma 
effect upon teaching and learning activities during the vear. At 
best, the correspondence between what it is possible to test in an 


y have an unwholesome 


480 MARKING AND REPORTING 


take in class projects or discussions? How astute are his comments? 
How deep is his apparent understanding of what is discussed? 

Whether such evidence should have any appreciable role in a 
course grade seems more debatable. Nature and extent of class 
participation is perhaps more a matter of temperament and attitude 
than of competence. And those skills that are represented should 
perhaps better be evaluated for their own sake, rather than be in- 
corporated in an over-all mark. To evaluate an individual's knowl- 
edge and understanding by listening to what he has to say in a 
discussion is not easy. Unless skills of working in a group represent 
one of the specific objectives of instruction, it seems doubtful that 
evaluation of oral contributions in class or group should enter into a 
class grade. They will probably only have the effect of watering 
down our measure of competence. 

Laboratory and Workshop Activities. In many areas of school 
work, competence is represented at least in part by what the indi- 
vidual can do with concrete objects or materials of one sort or an- 
other. Can the student saw a board straight? Make a buttonhole? 
Cook an edible cake? Machine a part to specified tolerances? Set 
up a microscope and prepare materials for observation with it? 
Such skills as these must usually be 


tested in the laboratory or 
workshop setting. 


In certain fields, this type of evaluation may 
appropriately receive substantial weight in a total grade. 

Evaluation of performance skills by unaided subjective judgment 
is likely to be quite unreliable. There are two approaches that can 
sometimes be used to improve the accuracy of appraisal of such skills. 


1. A standard task may be se 


t up, a scale of examples of different 
levels of goodness be prepared 


‚ and individual performance be judged 
by comparison with the set of standard samples. This type of 
product scale was discussed briefly in Chapter 11. 

2. A detailed check list or score 
the performance or product. 
evaluate an individual's pe 


card may be prepared to cover 
The check list can then be used to 
rformance as it occurs. Thus, we сал 
check step by step as he sets up work on a lathe to see that cach 
operation is done correctly and in Proper sequence. A scoring sched- 
ule can apply penalties for errors and omissions. Or a score car. 
can be applied to the product. Thus, in evaluating a cake baked 
by a student, point allocations and scoring standards could be set 
up for texture, flavor, consistency of frosting, 
Analytical scoring schemes and uniform re 
to improve reliability of judgme 


and appearanct. 
ference standards will help 
nts that are at best rather subjective. 


THE MECHANICS OF WEIGHTING PARTS IN A TOTAL 481 


THE MECHANICS OF WEIGHTING PARTS IN A TOTAL 

Let us assume that we have decided upon the weight we wish to 
allocate to each of several evidences of pupil competence. For ex- 
ample, suppose we decide that we wish to assign one fourth of the 
total weight to laboratory work, one fourth to a series of short 
sum of points earned on a midterm and 


quizzes, and one half to the 
bine these three scores in 


final examination. Our problem is to com 
such a way that this weighting is in fact achieved. 

Achieving the desired weighting becomes a problem because the 
r widely in variability. Suppose, in 


separate components may diffe 
three components had the follow- 


the illustration we have given, the 
ing statistical characteristics. 


Standard 

Range of Scores Deviation 
Laboratory work 63 to 86 4.2 
Quizzes 32 to 147 21.0 
Exams 48 to 105 10.5 


Suppose now we combine these three scores with weights of 1, 1 


and 2. What will happen? 
Let us concentrate for the moment on the laboratory work and 


quizzes. Suppose that the man who is best in one is worst in the 
other, and vice versa. If we combine them with equal weight, we 
find that the sum for the man who is at the top in quizzes and at 
the bottom in laboratory work is 63 + 147 = 210, whereas the sum 
for the man at the top in laboratory work and the bottom in quizzes 
is 86 + 32 = 118. There is a gross difference. This is due to the 
much greater variability of quiz scores, a variability that is in this 
instance 5 times as great as that for lab work. The more variable 


quiz grades have swamped out the less variable lab grades and 


affect the final composite 5 times as much. 
The point we are making is that the effective weight of a component 


in the total depends not only on its nominal weight but also upon 
If the variability is great, the weight applied to a 
raw score must be reduced if the component is to have the desired 
effective weight in the composite. In the example we have given, 
in order for the effective weights to stand in the ratio 1-1-2, the 
weights applied to raw scores would have to stand in the ratio 
1-16-45. Thus, if we multiplied the quiz grades һу 15, the two 
would have totals of 63 + 15(147) = 92.2 


1 out approximately equal. 


its variability. 


cases we considered above 


and 86 + 150632) = 92.4 and would turt 
In determining the relative weights to be applied to the raw 


scores on any component of a score composite, the desired effective 


482 MARKING AND REPORTING 


weights must be divided by a measure (or some rough estimate) of 
the variability of each of the components. For this purpose, even 
the range of scores can be used as an approximate estimate. How- 
ever, the semi-interquartile range or standard deviation is а better 
estimate. In our illustration, the desired effective weights were 
1, 1, and 2, and the standard deviations are 4.2, 21.0, and 10.5. 
Dividing, we get 1 4.2 = 0.24; 1 21.0 = 0.048; 2/10.5 = 0.192. 
These stand in the ratio 1-15-15 ог 5-1-4. One can go badly astray 
if one takes scores at their face value and weights them without any 
attention to their variability. 


THE ASSIGNMENT OF GRADES 

Suppose we have now scored all the tests, exercises, papers, prod- 
ucts, observations, and other behaviors that seem significant to us 
as evidences of competence in a field. We have put the separate 
scores together into a single composite score, weighting the separate 
types of evidence in the way that seems most appropriate to us. 
We have one score for each individual, constituting our best judg- 
ment of that individual's competence. The scores rank the indi- 
viduals in the group from most to least competent, according to our 
definition of competence. Now, how shall we arrive at a grade for 
each student? How many shall get A, how many B, and so on? 

Underlying Considerations. Before we can attack this question, 
there are two fundamental facts of life that we must face. 


1. The composite score representing our pooled appraisal of com- 
petence is a relative appraisal only, not an absolute one. It permits 
а comparison of one individual with another and a judgment of more 
or less. It does not permit a judgment of absolute amount or of level 
of excellence by an absolute standard. The score shows relative 
performance only in the group for which common evidence is avail- 
able. By intuitive judgment or by collateral evidence, we may have 
some impression of how this group relates to other 
evitably, such judgment entered into the tasks we set for the group 
or the standards by which we appraised performance. But this 
reference to other groups or broader stand 
pendable at best. 


groups. In- 


ards is fuzzy and unde- 


2. The meaning we give to numbers, letters, words, or other svm- 


bols as standards of excellence is a matter of ‘completely arbitrary 
convention. It is utterly futile to argue whether 10 per cent or 25 
per cent of a student group should receive A's, as if there were some 


eternal verity that determined this. Each teaching group must 


THE ASSIGNMENT OF GRADES 483 


define what the symbols they use are to mean. Furthermore, in view 
of the relativity of our appraisals, in the last analysis the only defini- 
tion that can be defended is a definition expressed as a range of per- 
centiles in a defined group. Thus, A might be defined as "the top 10 
per cent of a representative group of freshmen at W college," or as 
"the top 25 per cent of the group of М.А. candidates." There are 
no right or wrong definitions; there are only more and less socially 
useful and expedient definitions. 


In measurement textbooks in the past, there has been a good deal 
of talk about the normal curve and "grading by the curve." Much 
of this has been nonsense. What we have is a rank order of pupils 
and a rank order of evaluation categories. Where in the first rank 
Order we set the dividing lines of the second rank order is а com- 
pletely arbitrary decision and should be based on considerations of 
practical utility. Suppose that E signifies “Unsatisfactory. Does 
not receive credit. May take re-examination, with possibility of 
raising to D.” The decision as to what per cent of a representative 
group in а particular institution should be given E is not a statis- 
tical decision. It is a practical administrative decision. To make 
the decision realistically one would have to answer the question: 
Considering the type of students whom we receive, the social func- 
tions which our institution serves, and the implications for the indi- 
vidual that attach to the grade E as we use it, to what per cent of 
Students does it seem appropriate that this grade be given? The 
answer rests primarily upon the social and educational philosophy of 


the institution or of a segment of it. 
To assign grades rationally and consistently, we must 


l. Explicitly recognize the arbitrary social judgment that is im- 
plied in defining grading categories and make that judgment for our 
School and college on the basis of full understanding and rational 
analysis of the implications of our decision. 

2. Establish general adherence to the definition among the indi- 
Vidual faculty members of the institution. 

3. Devise techniques to assist the instructor in adapting and 
applying the definition to each specific class. 


Making the Meaning of Grades Explicit. Currently in. most insti- 
tutions the definition of grading categories is either in terms of 
verbally described levels of performance (e.g., А, superior; B, good: 
C, satisfactorv; etc.) or in terms of a completely inappropriate 
assumption of absolutism in which 100 per cent is perfect, 75 is three- 


484 MARKING AND REPORTING 


fourths of perfect and is just acceptable, and so forth. The inter- 
pretation of grades in terms of relative performance is ill defined 
and highly individual from instructor to instructor. But the rela- 
tive meaning is basically there. It sneaks in as raised marks on tests, 
as make-up tests when the original one was too hard, as standards 
of grading written work, and in other guises. This relative standard 
should be brought out into the light, closely and critically examined, 
and made clear and explicit for all. That is, some agreement should 
be reached as to what a grading symbol does mean in terms of stand- 
ing in relation to a standard reference group. A consensus must be 
sought within the institution that an A, for example, or a grade of 
90 or above signifies being in the top 10 per cent of typical general 
introductory courses and the top 15 per cent of advanced courses. 
If there are legitimate reasons for different standards in different 
fields or special courses, these should be made explicit. Thus, the 
social good might be served by having different standards for a 
freshman course in physics and one in public speaking. Making 
the general meanings and the exceptions to them explicit is the first 
step in achieving uniformity of meaning. 

Achieving Uniformity among Instructors. Even when an explicit 
definition is achieved, instructors will still show marked variations 
in their application of the definition to their own class groups. 
Though an agreement may have been reached to define A as “top 
10 per cent of a typical freshman group," some instructors of fresh- 
men will continue consistently to give 5 per cent of A’s and others 
to give 25 per cent. These remaining differences can be attacked by 
a continuous program of education and publicity. Grade distribu- 
tions may be prepared by each instructor and reviewed within the 
department. The distributions for different departments within the 
institution may be assembled and reported to the faculty. The 
discrepancies of individual instructors may be pointed out to them 
and may receive administrative review in extreme cases. Even 
when this is done, however, there are likely to be a few intransigent 
non-conformists who will refuse to accept the institution's definition 
of grading symbols. One suspects that in certain cases grading tend- 
encies reflect rather deep-seated personality trends that are not 
easily overcome. 

Adjusting for the Nature of the Particular Class Group. Suppose 
the arbitrary and relative nature of the grading system has been 
recognized and a serious attempt is made to set up a system defined 
in terms of fractions of the total group and to use this system in a 
consistent and rational manner. Then the most troublesome tech- 


THE ASSIGNMENT OF GRADES 485 


nical problem is to take appropriate account of the nature of a class 
group and of its deviation from the general reference group or groups 
used to define the grading categories. When individual class groups 
are small or when groups are formed taking account of ability, one 
cannot expect each group to represent the average defining group of 
"typical freshmen” or the like. Some procedure is needed to adjust 
for the specific nature of the group. Several suggestions will be 
offered. 


1. When there are several types of reference groups—e.g., under- 
classmen, upperclassmen, and graduate students or major students 
and non-major students, it may be desirable to have different 
definitions of the grading symbols apply to each and to make this 
clear. Thus, it might be decided that an underclass population of 
grades should include 10 per cent A's, an upperclass group 20 per 
cent, and a graduate group 25 per cent. The proportion to be ex- 
pected in a mixed group would be a weighted average of the separate 
group percentages. The rationale for such a differentiation would 
be the presumed. selection of abler students at the higher levels. 
Here again, however, whether it is desirable to redefine the meaning 
of the symbols at successive levels is a matter of judgment to be 
arrived at in terms of the purposes that the symbols аге to 
serve, 

2. It will probably be desirable to let the defining population for 


a subject area be the total population of pupils who take work in 


the area. That is, che 


mistry grades should be assigned in terms of 
the general population of students taking chemistry and industrial 


arts grades in terms of the total group taking industrial arts. An 
attempt to make allowance for differences in ability between the 
Tt is 


two student populations does not seem socially desirable. 
Probably more useful, and certainly more convenient, to define 
chemistry student S in terms of a population of chemistry students 
than of a population of students in general. This means that the 
Proportion of pupils assigned a particular grade should generally re- 
main essentially the same from subject area to subject area. 

| al courses handled аз a single group— possibly 
e—it will be most acceptable to use the total 
Class group as the defining population. That is, the specified pro- 
portions in the different grading categories should be maintained 
fairly rigidly for the group. This will introduce less error and dis- 
tortion than an attempt to make necessarily subjective allowances 
and corrections for differences between different groups. 


3. For large gener 
groups of 50 or mor 


486 MARKING AND REPORTING 


4. When pupils in what is essentially a single course with common 
objectives and materials of instruction are broken up into separate 
sections for teaching, a nucleus of common testing can provide the 
anchor and basis for determining the number of pupils receiving each 
grade for each section. The common nucleus is most likely to be a 
common final examination, but it might include appraisals at other 
times, and it might conceivably be a standardized test. This com- 
mon core of measurement would not control the mark received by 
any single student, but would control the numbers receiving cach 
grade in each section of the course. 

The way grades are adjusted by this procedure may be illustrated 
by a specific example. Let us assume as given: 


1. A general definition of the grading symbols as follows: 


А = top 15 per cent 

B = next 25 per cent 
C = next 40 per cent 
D = next 15 per cent 
E = bottom 3 per cent 


ll 


2. Two class groups, receiving scores on а common final examination as 
follows: 


Score Section 1 Section 2 Total 
120 or over 10 2 12 
100-119 11 9 20 

75-99 14 18 32 
45-74 1 8 12 
Under 45 1 3 4 
Total 40 40 80 


(The numerical values have been grouped 


6 е and arranged to simplify compu- 
tation for this numerical example.) 


We sce that we have 80 individuals in all. 
is 12, the number of A's. The data in the illustration are already 
grouped so that this represents the top score category. Thus, Sec- 
tion 1 would receive 10 A's, whereas Section 2 would receive only 2. 
The other groupings are also already set up to correspond to B, C, 
D, and E by our definition. The particular students to receive A's 
in either section would be selected on the basis of the total r 
of evaluation the instructor was using 
trolled by the common examination. 

If this procedure is used, it is most important that the common 
test tap equally the objectives and materials stressed in the different 
sections. The instrument should pool the ideas of all instructors and 


Fifteen per cent of 80 


ecord 
‚ only the number being con- 


THE ASSIGNMENT OF GRADES 487 


be acceptable to each. Even when this is accomplished, it may be 
desirable to temper somewhat the differences between groups, re- 
membering that the common test is only a part of the total appraisal 
and will not correlate perfectly with it. That is, in our illustrative 
example, we might rule that Section 1 could have 8 to 10 grades of 
A; Section 2 could have 2 to 4; and so forth. Final decisions could 
be based on the total appraisal of individual students within each 
group. 

5. When only a small group takes a particular course, it is prob- 
ably unsafe to allocate grades mechanically, following the percentages 
specified for a group at that level. One way of anchoring such small 
groups to a common standard is to use some common evaluation 
instrument with groups in successive semesters and compare cach 
new group with the accumulated evidence on groups that have gone 
before. The number of A's, B's, etc., may be adjusted to take 
account of the level of performance on the common test, a class 
that does exceptionally well being given an increased number of 
high grades and vice versa. This procedure encounters two difficul- 
ties: (1) some pupils in later groups may get advance information 
about the anchor test and (2) the anchor test may become inappro- 
priate for later groups. If these difficulties can be overcome, the 
procedure is a promising one. 

Where anchoring groups to earlier populations by a common 
Measure seems impractical, as it would for any course in which an 
Objective test is not an appropriate basis for evaluation, one is 
thrown back upon the judgment of the teacher. There is probably 
& kernel of accuracy in the instructor's impression that a particular 
class group is better than or worse than groups taught in а course 
In previous years, though the judgment may be as responsive to the 
instructor's sciatica as to the class's stupidity. So some tempering 
of the usual numbers at each grade level may be appropriate in such 
watch must be kept to make sure that such tem- 
instructor in question. 


а сазе. However, 
Pering does not become habitual with the 
an educational institution is a deeply ingrained 


It is usually accepted auto- 
of hold- 


A grading system in 
Part of the educational culture pattern. 
matically and with no more critical thought than our habits 
ing a knife and fork. The new teacher is not svstematically instructed 
grows into them as a child grows into the 

It seems unfortunate that our 
treated in such a casual fashion. 
go through for themselves 


In grading procedures but 
regional pronunciation of „water.“ 
*ducational evaluations should be 
It would pay the staff of any school to 


488 MARKING AND REPORTING 


the type of analysis that is represented in the previous paragraphs 
and to make explicit for each teacher definite guides as to the sta- 
tistical meaning of different grades and the procedures for arriving 
at them. 


SUMMARY STATEMENT 


We have now completed our scrutiny of marking and reporting 
procedures. Clearly, a number of different issues are involved, and 
no simple solution appears to be at hand for any of them. This is 
because there are so many different purposes that the school's eval- 
uations of the individual are designed to serve. Functions of motiva- 


tion, guidance, and certification аге all significant functions, but 
they tend to conflict with one another. The problems of marking 
and reporting take on quite different character at different educa- 
tional levels as different purposes come to the fore and different 
conditions prevail. 

In the elementary school, school evaluations serve primarily (1) to 
help motivate and guide the pupil’s learnings, (2) to inform parents, 
so that they can work closely with the school for the child's good, 
and (3) to provide, for the school records, background material for 
understanding the child's later development. The first of these 
functions appears best served by {һе school activities themselves, 
together with the immediate appraisal which teacher and pupil 
make of them. For the second we would prefer informal апа de- 
scriptive reporting to the parent, preferably in face-to-face con- 
ference. For the third, descriptive teacher appraisals сап advan- 
tageously be supplemented by standardized test results. In the ele- 
mentary school, the classroom teacher's contacts with his group are 
sufficiently extensive so that it is reasonable to expect appraisals 
that go beyond competence in subject matter. 

In secondary education, other purposes take the spotlight in the 
school's evaluations of its pupils. Evaluations function as (1)a 
major element in defining future educational and vocational goals 
for each student and (2) evidence to be used by colleges and em- 
ployers in admitting students to the goals they have chosen. At 
the same time with departmentalized instruction teacher-pupil con- 
tacts become less intimate. Appraisals of t 


ss he pupil as а person 
become more difficult and less satisfactory. In this setting, the 
typical school may find it best to fall back on the traditional system 
of letter grades as a basis for appraising competence in academic 
skills that are important for educational progress and vocational 
guidance. However, cumulative records providing supplementary 


REFERENCES 489 


evidence on (1) aptitude and achievement measured by standardized 
tests, (2) out-of-class activities, and (3) observations of personal de- 
velopment are important supplements. 

Whatever types of appraisal are undertaken, it is important that 
they be interpreted to pupil and to parents. Only then can they 
serve their guidance function. Some further aspects of this inter- 
preting process are considered in the next chapter. 

At the college level, the selective and certifving function of grades 
moves even more into the center of the stage. Unless comprehensive 
examinations or such standardized appra sals as the Graduate Rec- 
ord Examination come into more general use, grades will continue 
to be required in most institutions to serve these. administrative 
functions. 

If grades are to be given, they should be handled in a competent 
and consistent fashion. For the functions that it serves most appro- 
present as pure a measure of competence 


priately, a grade should ге 
It should have consistent meaning 


in a field as can be prepared. 
from instructor to instructor. To achieve such comparable ap- 
praisals of competence, the following steps are necessary: 


1. Define the knowledges and skills that constitute competence in 
a field and decide what weight should be given to each. 

2. Decide what types of evidence will be accepted as evidence of 
this competence, determine what effective weight should be given 
to each component, and handle the weighting of raw scores so that 
the desired weighting is in fact achieved. 
negotiated agreement on the statistical meaning of the 
ms of percentiles of a defined group or groups. 
adapting the definition to small or. 


3. Reach a 
grading symbols in ter 

4. Work out procedures for 
atypical class groups. 

In conclusion, a marking or reporting system should not be taken 
for granted. Every now and then the individual instructor or а 
faculty group should ask: Why do we go through this periodic agony 
of marks or marking? What ends are served? How might they be 
Served better? 

REFERENCES 


1. Chamberlin, D., et al., Did they succeed in college?, New York, Harper, 


1942, 
2. Lindquist, E. F.. Jowa Те of General Educational Development: Manual, 
Associates, 1951. 


Chicago, III., Science Research : 


490 MARKING AND REPORTING 


SUGGESTED ADDITIONAL READING 


Elsbree, Willard S., Pupil progress in the elementary school, New York, 
Bureau of Publications, Teachers College, Columbia University, 1943. 

Monroe, Walter S., Editor, Encyclopedia of educational research, rev. ed., 
New York, Macmillan, 1950, pp. 711-717. 

Smith, Eugene R., et al., Appraising and recording student progress, New 
York, Harper, 1942, Chapters 9-11. 

Strang, Ruth, Reporting to parents, New York, Bureau of Publications, 
Teachers College, Columbia University, 1947. 

Wrinkle, William L., Improving marking and reporting practices in elemen- 
tary and secondary schools, New York, Rinehart, 1947, 


QUESTIONS FOR DISCUSSION 


1. Do you agree with the authors that every mark is a relative judgment? 
If not, in what cases and on what grounds do you disagree? 

2. In what ways is the marking system in a school similar to a rating pro- 
cedure? In what ways does it differ? What factors that limit the effective- 
ness of ratings also limit the effectiveness of a marking system? How could 
the suggestions for improving ratings given in Chapter 13 be used to improve 
marking procedures in a school? 

3. How is the general level of ability of the class that a student is in likely 
to affect the marks he will get? f 

4. What should be the role of student self-appraisal in evaluating educa- 
tional progre What are the limits of such appraisal? 

5. Try to get copies of the report cards used in one or more school systems. 
Examine them, and compare them with the cards obtained by other class 
members. What similarities and differences do you note? What shortcom- 
ings do you feel they have? i 

6. Talk to a school principal or superintendent and find out what changes 
have been made in reporting practices while he was in the school system. 
Why were they made? How satisfied is he with the result? What provisions 
are made for parent-teacher conferen low satisfactorily have these 
worked out? What problems have arisen? How well does the present sys 
tem of marking and reporting serve the four functions listed on p. 458? l 

7. What problems arise when one tries to have a re 
of aptitude? 

8. Comment on the proposition: “A course grade is most useful when it 
measures as accuratelv as possible the pupil's mastery of the direct objectives 
of the course and is not messed up with any other factors." 

9. For a course that you teach or plan to teach, list the 
you would plan to consider in arriving at a course grade. 
to be given to each. Why have you allocated the weights in this way? 

10. You have decided to give equal weight, in a biology course, to (a) a 
series of quizzes, (b) a final exam, and (c) laboratory grades. A study of the 
score distributions shows that the quiz S.D. equals 10, the exam S.D. equals 
15, and the laboratory S.D. equals 5. How must you weight the raw scores 
to give the desired weight to the three components of the final grade? 


port card take account 


types of evidence 
Indicate the weight 


QUESTIONS FOR DISCUSSION 491 


11. When is it appropriate to “mark on a curve“? When not? When it is, 
how should the fraction to get each grade be determined? 

_ 12. What steps would you propose to take to reduce differences between 
instructors in grading standards? 

13. In college Y there are ten sections of freshman English. What steps 
could be taken to assure uniform grading standards, so that a student is not 
penalized by being in a particular section? 

14. When schools are considering changing their grading system from a 
percentage or letter grade to some other letter system based on the ability of 
the individual student such as S = satisfactory; О = outstanding; N — needs 
improvement: and U = unsatisfactory, the following arguments against the 
new procedure are frequently brought up: 


a. The level of achievement will be lowered because students will have no 
motivation to work hard. 

b. It will remove competition and competition is the basis of our society. 

c. When a child gets out of school, he will work for a living and earn a 

salary based on his competence on the job. Grades are the salaries of 

school children and should be given on the same basis. 

Marks based on individual ability give a child a wrong picture of him- 

self and the parents of the child do not know exactly where their child 

stands. 

e. A child should learn to adjust to failure since he will experience failure 
many times during his lifetime. If he has never failed in school, he does 
not get this experience. 

ach of these arguments? What false assumptions 


a 


What are the merits of ег 
does each make? 


Chapter 78 


Measurement in Educational 
and Vocational Guidance 


Mr. Wilson, guidance counselor at Center High School, has a con- 
ference scheduled with Walter Кау, a tenth-grade pupil. This is 
the first conference. From the regular school testing program, Mr. 
Wilson has the following aptitude and interest test percentiles for 
Walter: 

Kuder Preference 


Differential Aptitude Tests Record, Vocational 


Verbal Reasoning 78 Scientific 62 
Numerical Ability 74 Outdoor 80 
Abstract Reasoning 65 Computational 46 
Space Relations 82 Clerical 40 
Mechanical Reasoning 80 Literary 30 
Clerical Speed and Accuracy 48 Artistic 32 
Language Usage: Spelling 72 Musical 45 
Language Usage: Sentences 76 Persuasive 52 

Mechanical 75 


Social Service 48 

Walter's course grades for the previous year placed him about 60th 
in the class of 200. Walter's father is a fairly successful local busi- 
ness man. Walter has indicated that he wants to become a doctor. 

What significance do these test results have for Walter's expressed 
vocational goal? Do they imply greater suitability of other voca- 
tional goals? How are the results and their significance to be con- 
veyed to Walter? These are not easy questions to answer, but suit- 


able answers to them are at the heart of counseling. We must 
examine them in some detail, 


THE SIGNIFICANCE OF TEST 


S FOR A 
VOCATIONAL GOAL 


To judge the suitability of Walter's vocation 
his test scores, we need to know what the char 
with that score pattern will reach th 

492 


al goal in the light of 
1ces are that someone 
at goal. This is a large order. 


TESTS INTERPRETED IN TERMS OF EXPECTANCY 493 


If we break it down into parts, maybe we will see why it is so over- 
powering. In Walter's case we need to be able to estimate 


1. The probability that he would be accepted as a student by a 
college. 

2. The probability that he would complete a premedical program 
successfully, if accepted. 
8. The probability that he would be accepted by a medical school 
if he completed premedical training. 

4. The probability that he would be graduated from medical 
school, if accepted. 

5. The probability that he 
success and satisfaction as a doctor, if he were 


would achieve minimum standards of 
graduated from 


medical school. 


What sort of a judgment can be made with respect to the prob- 
ability of Walter successfully passing each of these hurdles? 

The first hurdle is getting into college. Since colleges are likely 
to pay attention to standing in high-school class and to scholastic 


aptitude, we should examine the evidence we have on these points. 
class in his high school 


Walter is high in the second quarter of the 
In scholarship. (Our present information does not tell us what level 
of achievement this would represent in terms of broader norms.) А 
роойп of the Verbal, Numerical, and Abstract scores on the DAT 
scholastic aptitude and provides a fairly 
&ood predictor of academic achievement. On these three parts he 
has percentiles of 78, 74, and 65, averaging about the 75th percen- 
tile. The aptitude tests and school achievement are in rather close 
agreement, and we can feel fairly secure in the picture of ability 


level that is provided us. 


comes close to representing 


TESTS INTERPRETED IN TERMS OF EXPECTANCY 

Now what about getting into college? What are the chances that 
à boy who falls at about the 70th to 75th percentile of a tenth-grade 
group will get into college? Table 18.1 summarizes data from two 
studies on the relationship between 1.О. in elementary and secondary 
School and entrance into and graduation from western state univer- 
Sities, "They provide a fairly realistic picture of likelihood of starting 


and completing higher education.“ 

ild be pointed out. In the first place, it is 
] going to college in the 1930's. Rates of 
hat since then, so the rates are probably 
he per cent of all children in the initial 


* Certain limitations of this table shou 
ased on pupils tested in the 1920's and 
College attendance have increased somew 
Somewhat low. Furthermore, it shows t 


494 MEASUREMENT IN GUIDANCE 


A table such as Table 18.1 is called an expectancy table. It shows 
the expectancy of achieving a specific criterion of success (in this 
ty) at cach 


case, admission to or graduation from a state univ 
predictor score level. The counselor could with advantage use a 
host of tables of this sort, relating the widely used predictor instru- 
ments to a wide range of significant educational and vocational cri- 
teria. Unfortunately, for the most part the tables do not exist in 


Table 18.1. Probability of Entering and Being Graduated from College in 
Relation to l. O. While in School 


Chances in Chances in 
100 of 100 of 

1.Q. Level Entering Graduation 
150 92 60 
140 79 48 
130 64 35 
120 50 25 
110 36 16 
100 22 7 
90 11 2 
80 4 1 
70 1 0 


Adapted from Adams ' and Keys.? 


any organized form. Any working counselor carries in his head 
crude approximations to such tables, but the approximation is likely 
to be very rough and may sometimes be badly biased. 

We can roughly translate Walter's standing at about the 75th 
percentile of the tenth grade as corresponding to an TO: of 1*0 to 
115. Referring to Table 18.1, our best estimate from the evidence 
at hand would seem to be that Walter has about 2 chances in 5 of 
entering a college such as these 
graduated. 

However, colleges differ very widely in their selectivity. The 
chances of being accepted in an Ivy Lez A 


і iderably 1 ague liberal arts college would 
> considera * less " 2 : im : 
»e conside y less than the values we have shown, which are based 


on state universities. The probability. of acceptance by a junior 
college, small teachers college, or local municipal college would be 
m many instances substantially higher, 


and about 1 chance in 5 of being 


group going to college. In many cases, non-attend. 
of interest or financial factors and not to inability to 
those who applied for admission, the rates would. hav. 
middle and upper I.Q. levels. 


ance was probably due to lack 
qualify for admission. Among 
€ been higher, especially at the 


INTEREST MEASURES IN RELATION TO VOCATIONAL GOALS 495 


What of Walter's chances of acceptance by a medical school? 
Again, the evidence we must put together is fragmentary. А study 
In 1948 2 indicated that the number of persons applying for medical 
training was at that time about 4 times the number of vacancies in 
medical schools. The average applicant who completed college and 
was cligible to apply had about 1 chance in + of being accepted. 
Referring back to the data in Table 10.2, we see that on those apti- 
tudes most closely related to academic success Walter is about 10 
percentile points below the average of the group of students who 
went from high school into premedical training. From those rather 
limited data we would judge that even if he completed college, 
W alter would have appreciably less than 1 chance in 4 of getting 
into medical school— probably no more than 1 in 8 or 1 in 10. Com- 
bining this with his probability of completing college, we might con- 
clude that his chances are no better than 1 in 40 or 50 of reaching 
medical school. This is perhaps as far as we should try to carry our 
There are, of course, further questions 


thinking in Walter's case. 
to complete the medical school 


about whether he would manage 
admitted to it and of whether he would be suc- 


Course if he were 
problems 


cessful or happy as a doctor if he got his M.D. But these 
are relatively remote in this case, compared with the intermediate 
academic hurdles. 

Because of the limited data on which it is based, our estimate of 
1 chance in 40 or 50 is a rough general estimate for boys falling at 
ап LQ. about 110 to 115. It is an even rougher estimate for Walter 
in particular, because so far it takes account only of aptitude meas- 

It does not include any appraisal of 
th of his motivation, or the degree of 


both financial and motivational support his aspirations will receive 
from his family, All these factors, and others as well, will modify 
the probabilities in Walter's individual case. However, some such 
estimate is always implicit in the work of the counselor when he 
a set of test scores. We have tried 
and show the thinking by which 
Whatever the exact value arrived 
ar as present evidence is con- 
announced goal do not ap- 


1 87 and school achievement. 
Valter's interests, the streng 


tries to assess the significance of 
here to make the process explicit 
the final estimate was reached. 

at, we would have to agree that as f 
cerned Walter's chances of reaching his 


Pear very good. 


INTEREST MEASURES IN RELATION TO VOCATIONAL GOALS 
ntion to the interest scores. This is 


So far we have paid no atte 
ctly to success in the academic 


Partly because they relate less dire 


496 MEASUREMENT IN GUIDANCE 


training that is prerequisite for Walter's objective; correlations of 
interest scores with academic achievement in general or in specific 
areas are generally quite low. In part we have avoided bringing 
interest into the picture because the manner in which interest pat- 
terns are related to job success and job satisfaction is far from clear. 
Ав we indicated in Chapter 13, practically all the information on 
interest patterns of particular occupational groups is based on per- 
sons already in the occupation. Furthermore, it is not clear how 
close to the typical member of an occupation a person needs to be 
in order to be happy or successful in the occupation. In many 
occupations, such as engineering, there are wide variations in specific 
jobs within the occupation. Some may involve much social contact 
work and some little; some may involve primarily outdoor work and 
some indoor, and so on. Thus, there may be a place for individuals 
with quite different interest patterns within a single occupation. 
Closeness of correspondence with the typical may be pointed out to 
the counselee, but we would hesitate to counsel avoidance of an 
occupation solely because his interest pattern departs from what is 
typical of the occupation. 

In Walter's case, we may compare his percentiles on the Kuder 


with the average percentiles for physicians and surgeons as a group. 
The comparison is shown below. 


Physicians 
Walter and Surgeons 
Outdoor 80 60 
Mechanical 75 37 
Computational 46 32 
Scientific 62 79 
Persuasive 52 26 
Literary 30 62 
Artistic 32 61 
Musical 45 58 
Social Service 48 60 
Clerical 40 27 


Clearly, there are some appreciable discrepancies—discrepancies of 
30 percentile points or more. On the interest dimension (scientific) 
that is highest for the physicians and surgeons and therefore pre- 
sumably most critical for the occupation, Walter falls 17 percentile 
points below the group average. However, discrepancies of this size 
would undoubtedly appear in many instances if we were to compare 
individual physicians with the average of all physicians. We cannot 
say how often. Thus, though Walter's Kuder profile does not par- 


ticularly suggest the interest patterns of a doctor, we have no real 


THE MEANING OF COMMUNICATION 497 


basis for concluding that his interests are incompatible with that 
field of work. If we rank the ten interest areas in order for Walter 
and for the physician group and correlate the two sets of ranks, we 
find that the correlation is about —0.1. Thus, Walter's order of 
preference is neither like nor particularly unlike that for physicians 
and surgeons. 

We have now done about as much as we can in terms of the evi- 
the realism and suitability of Walter's ex- 
pressed goal. Our best judgment would indicate that Walter's 
abilities are quite low for the occupation he has chosen, giving him 
only a rather small chance of achieving his expressed goal, and that 
there is nothing in his interest pattern to affirm strongly its suit- 
we need to get better acquainted 
etting acquainted would have 


dence before us to ass 


ability. At this point, obviously, 
With Walter as a person. (Often, this g 
Preceded testing, but it is assumed that in this instance the test 
data were gathered as part of a routine group testing program.) 
1 his getting acquainted will depend in part upon the other types of 
information about Walter that are already a matter of record; in 
bart upon conference with Walter. 

Let us assume that the evidence that we gather does not essen- 
e of abilities or interests that are provided 


Nally change the pictur à ‹ 
s the pattern of W alter's moti- 


by Walter's test scores. It illuminate 
vation by bringing out a strong family commitment to а professional 
Career for Walter and a somewhat unanalyzed but long-standing 
Conviction on Walter's part that he wanted to “help people keep 


Vell and strong." What are we to do now? 


COMMUNICATING TEST RESULTS 


It would be rather generally accepted in present-day counseling 
s f the counseling process are that the 
Counselee shall understand himself, accept himself, and arrive at а 
Program of action to which he himself is committed in the light of 
the evidence. If these objectives are to be achieved, the general 
Implications of the test results must be communicated to the client. 

here are two keys to this statement. One is "communicated to 
the client"; the other is “general implications." Let us consider 


сас $ 
ach of these a little further- 


th: у 
hat the important goals o 


THE MEANING OF COMMUNICATION 


to someone € 
ay say to a cC 


Ise must be distinguished from 


What w ice 
we communicate А XX 
hild, “You're a dumb 


whe ^ ў 
hat we say tohim. A teacher m 


498 MEASUREMENT IN GUIDANCE 


bunny." What she communicates is very likely: "I don't like you." 
The message sent and the message received are quite different in 
this case and in many others. Our problem, in working with a client, 
is not simply one of stating in an accurate and objective manner 
what the tests show. It is one of having the client comprehend and 
accept a particular picture of himself, one that may be quite a bit 
different from the picture he has held heretofore. 

Really communicating with a client, really getting him to accept 
the implications of the test results and incorporate them into his 
picture of himself, is far from easy. This is true particularly when 
the change in his self-picture involves adopting a less flattering view 
of himself. In some cases, as in that of Walter, the implications 
may be quite traumatic. Communicating with parents who have 
made firm commitments for their child's future, 
satisfving their own needs vicariously throug 
even more difficult. 

There have been a number of atte 
different ways of presenting test data. 
point out some one procedure as partic 
best suggest a few guiding principles. 


and who may be 
h their child, is often 


mpts to study experimentally 
However, the findings do not 
ularly effective. We can at 


1. Change in the s If-picture should be 
and continuing process Presentation of test implications may be 
more effective if it is a continuing process, influencing all the 


? : н ; 
counselor's contacts with the client, rather than a single dramatic 
event. 


thought of as a gradual 


1 2. Test results gain meaning and significance in relation to other 
life experiences. The attempt should be made to relate the test 
findings to other experiences in and out of school. Where test results 
and other experiences, i.e., of academic Or work success, are con- 


gruent, they serve to reinforce and give 


Thus, in our instance, W. 
ing are in essential agre 


meaning to each other. 
alter's aptitude measures and school stand- 
à ement, and cach reinforces the other as an 
appraisal of aptitude for college Work. Where test results and other 
types of evidence are at variance, a scarch with the counselee for the 
reasons for the discrepancy may provide a deeper basis for self- 
understanding. x 
3. The individual should take 
results to himself and his plans. 
ing only what the 


àn active role in relating the test 
We have succ 
Client himself sees. 
testing that communication is to have 


ceded in communicat- 
One way of assuring and 
him participate in interpret- 


THE OBJECTIVES OF COMMUNICATION 499 


ing the findings. This does not mean that he can be expected to 
work out the technical significance of a test score. This is a job for 
the counselor. Rather, once the technical interpretation has been 


made he should participate in relating the results to his own prob- 


lems or plans. 


THE OBJECTIVES OF COMMUNICATION 

In our communication, what we wish to convey to the client are 
the general implications of the test results. It is usually neither 
Necessary nor particularly desirable to report to the client his exact 


Scores or exact standing on tests. Reporting exact values may have 
is likely to convey an im- 


Several undesirable effects. The report 1 
Pression of precision not at all justified by the basic data. To tell a 


Child's parents that his J. O. is 117 or to tell an adolescent that he 


falls at the 78th percentile on a test conveys an atmosphere of exact- 


Ness and finality quite inappropriate for our educational and psycho- 
It ignores the standard error of the 
Score. As we have emphasized repeatedly, any test score must be 
thought of as identifying a fairly broad range within which the indi- 
vidual's true ability lies: This concept is very difficult to convey to 
Parent or child if an exact score value is reported. 

The concept of range can best be incorporated in the manner in 
which the test 1 als are presented and interpreted to the client. 
Thus, an LO. of 107 (or one of 96) for an clementary-school child 
a y ht be reported as “about like most children 


in ability to do school work.“ Ап J. O. of 120 in this context could 
„ arn school work somewhat 


€ reported as "can be expected to lea A 
One of 85 might be expressed as “will 


Probably have more diffculty than many children in doing school 
tasks.” "These phrasings are only suggestive. The point is that our 
report is expressed (1) in broad and rather general terms, (2) in 


terms of its practical implications, and (3) somewhat tentatively. 
for which the norms are given in percen- 


arly careful about our interpretations. 

his is duc to the unequal units of the percentile scale. The large 
Middle range of percentile values does not represent any very great 
Spread in level of performance on most tests, and we must be careful 
Not to overinterpret percentile differences occurring in this range. 
Anything from the 25th to the 75th percentile should probably be 
thought of as “about average." and reported as such. Thus. in de- 
Scribing Walter's test results to him, we might say, "Your aptitude 


logics CORE 
ogical measuring instruments. 


a typical school mig 


more ; 1 
Ore easily than most." 


ms hen working with tests 
Ules, we have to be particul 


500 MEASUREMENT IN GUIDANCE 


test scores show abilities on these tests somewhat better than the 
average boy in the tenth grade. You did about as well on one test 
as on another, though there is a suggestion that vour clerical speed 
is a little below the other abilities. Your scores on tests related to 
college were somewhat lower than the average of boys who go on 
into college in a premedical program. Your areas of highest interest 
were outdoors and mechanical." This is probably as much of an 
interpretation as this set of test data warrants, though other aspects 
of the tests might merit further discussion in relation to specific edu- 
cational or vocational plans. 

Related to the problem of conveying a false impression of exact- 
ness is that of overemphasizing and overvaluing the test results in 
the client's mind. This danger is scen perhaps most dramatically in 
the case of an I. O. reported to a parent. In some degree most parents 
live vicariouslv through their children. Some compete through their 
children. They know enough about an 1.Q. to recognize it as a 
mark of status. You can get ahead of Mrs, Jones next door by 
having a brighter child in much the same way that vou can by 
having a more expensive car or a new fur coat. 


Conversely, a. low 
I.Q. may be a basis for self-repro 


ach or for rejecting the child. These 
are unworthy, even vicious, uses of test results. They are fostered 
by meager understanding of tests and by personal involvement. 
This type of misuse of test results is another reason why the coun- 
selor usually prefers to Feport test findings only in general terms. 


TESTS IN THE IDENTIFICATION OF 
VOCATIONAL OBJECTIVES 


In the case of Walter, as we 
to deal with a boy who had e 
vocational goal. Our initi 
pects that his goal could 
plans and objective 


have just been describing it, we had 
xpressed a definite educational and 
al problem was to try to assess the pros- 
be achieved. Clarification of Walter's 
5 would have to include 
an estimate of the plausibility of his e 
pose that a counselee i 


communicating to him 
xpressed objective. But sup- 


E comes in who expresses no definite objective, 
or suppose Walter wishes to conside 


What then? Let us organize our 
around another case. 


T Other possible objectives 
consideration of this situation 


Consider this second case, that of Henry White, who is a class- 
mate of Walter's. Henry's father works 
Henry has stated that he doe. 


he grows up. His test percen 


as a railroad conductor. 
$ not know what he wants to do when 
tiles are as follows: 


LIMITATIONS OF APTITUDE MEASURES 501 
Kuder Preference 


Differential Aptitude Tests Record, Vocational 


Verbal Reasoning 36 Scientific 46 
Numerical Ability 54 Outdoor 37 
«lbstract Reasoning 42 Computational 45 
Space Relations 56 Clerical 78 
Mechanical. Reasoning 60 Literary 26 

Artistic 48 


Clerical Speed and Accuracy 38 
Language Usage: Spelling 22 Musical 81 


Language Usage: Sentences 31 Persuasive 56 
Mechanical 68 


Social Service 82 


35th in the class of 200. 

rent situation from the one 

nitially was to ap- 
In Henry's case, 

e whether certain 


In academic work, Henry is about 1 

This case presents us with quite a diffe 
his faced with Walter Kay. There our problem it 
praise the appropriateness of a stated objective. 
Our problem is to se 
activity appear particularly indi- 
to help Henry to get better 


ne objective is expressed. 
areas of educational or vocational 
cated by the test evidence and, if so, 
acquainted with these possibilities. 
Henry, who is below average both on 
the aptitude battery and in scholarship, 
Candidate for college education. The probabilities are that he does 
not aspire to it. If he does, counseling should be directed at a thor- 
ough review of those plans in the light of the evidence on aptitude 
and achievement. Any education planned beyond high school should 
Probably be in a type of institution and a type of program making 


only modest intellectual demands. 
Positive guidance of Henry in re 
to a considerable 
What can we 
vocational obje 
What steps should a counselor 
more definite vocational 
the resulting plan may 
that would be de- 


the intellectual aspects of 
does not seem a promising 


lation to immediate educational 
extent upon sharpening up 
say about Henry's voca- 
ctive is suitable for 


Choices would depenc 
He atoi objectives. 
hi al prospects? What type of à 
um in the light of his test scores? 
lake to help Henry set up suitable and 
goals? When these steps have been taken, 
Provide guides as to the high-school program 
Strable, e.g., clerical, vocational, prebusiness, or general. 


LIMITATIONS OF APTITUDE MEASURES FOR IDENTIFYING SUITABLE 


VOCATIONAL GOALS 
If we limit ourselves for th 
Dust admit that they provide only 
Vocational goal for Henry and that wha 


e moment to the DAT subtests,. we 
limited help in establishing a 
t they provide is largely 


502 MEASUREMENT IN GUIDANCE 


negative. If we exclude collegiate education, we exclude those jobs 
that depend upon college or professional education. We do not 
need to consider law, medicine, engineering, architecture, or similar 
professional occupations. The low verbal, spelling, and sentences 
scores may also steer us awav from some non-collegiate jobs with a 
heavy linguistic loading, i.e., stenographer. However, that still 
leaves us perhaps 90 per cent of the world of work to choose from. 
Should Henry think of becoming a mechanic? A farmer? A sales- 
man? А conductor like his father? Any of these and many others 
appcar quite possible in the light of his aptitude profile. 

Why can we offer no more specific positive guidance on the basis 
of Henry's abilities? Basically, four considerations enter in. 


1. Profile Not Sharply Differentiated. In Henry's case there isn't 
too much difference in his different abilities. His scores on this 
testing range from the 22nd percentile (Spelling) to the 60th (Me- 
chanical), but in this middle percentile 
ment is such that his relative st 
quite possibly be reversed if he 
the tests. Even assuming 


range the error of measure- 
anding on any pair of tests could 
were retested with another form of 
that the obtained scores are approxi- 
mately correct, most of the differences are not large enough to have 
great practical significance. We may feel that thie lower scores on 
the tests of verbal comprehension and language usage have some 
significance for vocational planning, but beyond this there is not 
much to say. і 
Мапу people will show test profiles of this general type. Their 
abilities are all at about the same level, They have no outstanding 
strengths and weaknesses. Their test scores provide a general indi- 
cation of level of ability but limited cue 
weaknesses. With no special strengths or weaknesses, they appear 
about equally likely to succeed in many different буроо of jobs. 
: 2. Lack of Unique Relationship of Ability Profile lo Job Success. 
оа higher on опе ог two abilities 
than iners, thi can that there are one or two specific 
jobs in which he will be uniquely successful. The boy whose specially 
high point is mechanical comprehension may do well as an automo- 
tive mechanic, but he may also do well as a te 
as a farmer. The person high on nume 
as a bookkeeper or as 


$ as to specific strengths and 


lephone repairman or 
rical ability may be successful 
asurveyor. For the typical individual there are 
rficially rather different jobs for which his ability 


pattern 1з equally suitable and many others for which his talents 
are adequate. In terms of the 


at least several supe 


Ir aptitude requirements, jobs come 1n 


ROLE OF INTEREST MEASURES IN VOCATIONAL OBJECTIVES 503 
8 7 Е amd опе family merges gradually into the next. 
s s for different jobs 1з a matter of degree, and any person is 
LE well suited to a number of jobs. Positive guidance 
1 15 55 ө e. be only in terms of broad segments of the world of 
б here is по one job for which each person is best fitted. 
Р и Knowledge of the Ability Requirements of Jobs. Over 
ie uma na limitations ar sing out of overlap 19 the true ability 
etii толы f jobs, there are limitations stemming from our own 
У ance. Our knowledge of what abilities are required. by what 
Jobs is still quite limited. We are not in a position to state with 
ee what abilities a high-school student should display if he 
ecome a successful plumber, shoe salesman, truck driver, or 
NE We do not know what people who have 
successful and contented in these jobs were like when thev 


entere H E 
Mered the world of work or what they are like now. We do not 
d to work in these fields and 


ation about the abilities that 
aining but only scattered in- 
ed in a job. 

xist is not readily avail- 


Service-stati 
Vice-station operator. 


mo ipis types of people have tric 
2 . We have a good deal of inform 
are required to succeed in advanced tr 
Ormation about what is required to succe 


" A fair part of the information that does e: 
able Part of it has been done for specific private 


made public, either through specific 
of other activities. Part of it has been 
agencies as the U. S. Employment 
functions of the gathering agency. 
Abilities Are Inadequate. We 
f measurement for only 


ies putent form. 
policy ens and has not been 
Бар »ecause of pressure 
Ass d by such. government 
8 erviee, and serves primarily the 
4. Techniques to Measure Certain 


practical tools О 


h 
One most impressive gap lies 


t S aas sound and practical 
in th of the range of human abilities. On tiun { 
.. the area of social skills and techniques. Abilities to understand 
and react sagaciously to problems involving other people appear to 
important in many sales, contact. and managerial jobs. We 
ave no tests of demonstrated validity in this area. Skill in prac- 
not solving puzzles of a verbal and academic 
which we do not 7 


Our inability 
es stems in part from our 


tic 
al problem-solving. 


Sort, is measure very well. Other 


a related area in 


Im 5 ۳ ake ? E 
"portant gaps also exist. to make sound and dis- 


ob possibiliti 


inguishing aptitudes. 


Опер В А 
, пСпуе suggestions about J 
ant disti 


inabir 2 
lability to appraise import 
ROLE OF INTEREST MEASURES IN IDENTIFYING VOCATIONAL OBJECTIVES 
of the interest inventory re- 


i So far we have not discussed the use 
sults in Henry's case. When the problem 18 to explore areas of work 


an ۹ % : & + 
d focalize vocational objectives: the 


interest measures may be of 


504 MEASUREMENT IN GUIDANCE 


as much or more value than ability measures. They should not be 
interpreted rigidly or taken at face value, but they do provide a 
starting point for discussion. Thus, the counselor could explore with 
Henry his apparent interest in clerical types of activities. If the 
test score was supported in discussions, the next step might be to 
provide Henry with a chance to become acquainted with possible 
clerical types of work in his community, either through reading or 
hearing about the jobs or through vacation or part-time employ- 
ment. 


Summarizing, in view of all the above it appears clear that guid- 
ance with relation to vocational plans will in most instances have to 
be couched in rather general terms. Guidance with regard to gen- 
eral educational level will be quite possible. 
broad areas and tvpes of jobs will often 
this, there is much free space 
tunity, exploratory tryout, and 
operate. 


Some indications of 
be appropriate. Beyond 
within which interest, local oppor- 
individual icliosyncrasy may freely 


RESOURCES FOR JUDGING THE EDUCATIONAL AND 
VOCATIONAL SIGNIFICANCE or TEST SCORES 


Before he can provide constructive guidance to a client's thinking 
about educational or vocational pl 
picture in his own mind of the c 
tions of the counselec's test sco; 


the probability of realizing a st 


àns, а counselor must have a clear 
'ducational and vocational implica- 
res. He must have an estimate of 
| ated objective ога picture of objec- 
tives that are appropriate. То what sources may the counselor 
turn to build up his skills of evaluating test patterns in a sound and 
realistic manner? Where can he find help in translating a set of 
test scores into a prediction of probable success? 
Ready-made tables showing the ch 
job at different test Score levels are | 
ent time. Perhaps they 
must be conte 
few of these. 


ances for success in any given 
argely non-existent at the pres- 
A never can exist. For now, the counselor 
nt with much more modest aids. We will consider a 


REVISED MINNESOTA OCCUPATIONAL R 


ros ATING SCALES 
This little monograph 4 


м рһ provides ratings of some 400 occupations 
with respect to the minimum level of (1) academic ability, (2) me- 
chanical ability, (3) social intelligence, (4) clerical ability», (5) mu- 
sical ability, (6) artistic ability, and (7) physical agility кае to 


REVISED MINNESOTA OCCUPATIONAL RATING SCALES 505 


succeed in the occupation. Ratings аге given in four levels, which 
are generally to be interpreted as follows: 


A: above the 90th percentile of the adult population 
B: 76th to 90th E 8 i 85 

C: 26th to 75th - 
D: 1st to 25th 


1 he exception is the scales for musical and artistic ability, in which 
А is defined as the 97th percentile or higher, B as the 90th to 96th 
percentile, and C as the 26th to 90th percentile. The ratings are 
frankly a synthesis of judgments. However, they are the judgments 
of highly trained individuals whose professional careers have cen- 
tered around the study of jobs and the requirements of jobs. 

We might consider our two illustrative cases to see what guidance 
the scales can give us. First, considering Walter with his aspiration 
to be a doctor, we have the following picture: 


Physicians Walter 
Academic A C+ 
Mechanical I B- 
Social B ? 
Clerical C G 
Musical D ? 
Artistic D ? 
Physical B ? 


Clearly, the discrepancy we have noted between Walter's academic 
aptitude and the demands of the job are confirmed by these ratings. 
We would need information from other sources about Walter's social 
ability and physical agility to judge whether these also represent 
Points of discrepancy. 

that Henry is considering trying to get a 


Let us now suppose 
clerical job with the railroad that employs his father. He hopes that 
ibly as station agent in a 


he might work as a ticket agent or poss! 
small town. The evidence looks like this: 


Station Agent Henry 
Academic С С 
Mechanical б c 
Social С ? 
Clerical B C 
Musical D ? 
Artistic G Г 


Physical C 


506 MEASUREMENT IN GUIDANCE 


We would have to judge whether the discrepancy. between the s 
dence we have on Henry's clerical skill and the demands of the job 
is a sufficient basis for discouraging this particular goal. Further 
evidence should be sought on Henry's clerical skills. 


TEST MANUALS 


Some test manuals provide information on the scores of рааш: 
occupational groups. The report may include no more than the 
median score for specific groups. We saw such figures for the Dif- 
ferential Aptitude Tests in Table 10.2. Sometimes rather complete 
norms may be provided for individuals in particular occupations. 
Thus, the Minnesota Vocational Test for Clerical Workers ША, 
percentile norms for the following groups of workers in the clerica 
field : 

Women: Office machine operators. 
Stenographers and typists. 
General clerical workers. 
Routine clerical workers. 


Men: — Bank tellers, 
Accountants and bookkeepers. 
General clerical workers, 
Routine clerical workers. 
Shipping and stock clerks. 


On the average, members of these occupational eroups score sub- 
stantially higher on this test than do workers in general. A score 


set at perhaps the 10th percentile of persons employed in one of 
these occupations might constitute a 


a warning level in the guidance 
of aspiring students, 

А test manual that provides extensive 
tional requirements of jobs is the m 
Test Battery (GA TB) of the U. S. Employment Service. This manual 
identifies twenty “occupational aptitude patterns" and proposes 
minimum aptitude standards for each. Thus, Pattern 1 is defined 
by minimum scores of 130 * on G (general intelligence) and 130 on 
V (verbal ability). It includes occupations involving literary work. 
creative writing and translating, Copy writing, and journalism. 
Nineteen other patterns are given, requiring different aptitude com- 


binations and different minimum levels, and many specific jobs are 
assigned to each. E 


Dee 

standards on the pire 
` у е 

anual for the General Ai 


* Standard scores with mean of 100 and S.D, of 20. 


SUMMARY STATEMENT 507 


F ог many of the jobs, the evidence upon which the determination 
of minimum scores and the assignment of jobs to patterns is based 
is rather limited, and the evidence is open to some criticism on tech- 
nical grounds. However, the GATB data available in the records 
at the U. S. Employment Service represent one of the major pools 
of data on the relation of tests to job success. It is unfortunate 
that this information is not generally available to school counselors 


in freely distributed published sources. 


DATA ON ARMY GENERAL CLASSIFICATION TEST 

Information on the level of general intellectual ability of indi- 
viduals in different occupations is provided by World War II Army 
General Classification Test data. The data for selected occupations 
are shown in Table 9.3. A much more complete table, covering 
many more occupations, may be found in the original journal article.“ 
ral intellectual level of young men who had 


This table shows the gene 
However, it gives no indication of re- 


e 3 ^ А 
ntered different occupations. 


qui nee NF 
luirements for more specialized abilities. 


SUMMARY STATEMENT 


ducational and vocational counsel- 


The process of using tests in е 
First, the counselor must himself 


ing j Ё 
8 involves two main steps. 
arrive at a sound interpretation of the significance of the test data. 


Secondly, he must communicate that interpretation to the counselee 
In such a way that the counselee’s self- picture and plans come to 
Correspond better with the realities of his abilities and interests. 
Two somewhat different situations arise in the interpreting of test 
results. In one, the client expresses definite educational or voca- 
“onal goals. The counselor must interpret the test results in rela- 
Поп to those goals. He must arrive at some judgment as to the 
likelihood that the goals can be attained and the probability that 
they will prove acceptable to the client if they are attained. The 
evidence by which the counselor is to reach this judgment is frag- 
Mentary and scattered. Some sources that may help him are sug- 
ested in the previous section. 


In the second case, the client’s goals are vague and undefined. 
ct the counsclee’s attention to areas that 


results. For most clients this 
broadest of terms because of the 
nt jobs require. Guidance 
onal aspiration seems 


The counselor must dire 
ook promising in ternis of the test 
Туре of guidance can be only in the 
Wide overlap in the abilities that differe 

35 to general level of educational and vocati 


508 MEASUREMENT IN GUIDANCE 


plausible, and counselees can be steered away from plans calling for 
abilities they lack. Positive suggestions should, however, be ex- 
pressed in quite general and tentative terms. . 

The process of communicating test results also raises certain prob- 
lems. If the test results are to be helpful, they must be presented 
in a way that makes it possible for the client to accept them. This 
is particularly difficult when the tests аге less flattering than the 
client's previous self-picture. 

Test results should usually be presented in rather general terms 
and with emphasis upon the interpretation and significance of the 
results. The interpretation should help to avoid overemphasis on 
specific test scores and should at the same time help the client to 
relate the tests to the plans to be made and the action to be taken. 


REFERENCES 


1. Adams, F. J., College degrees and element 
tients, J. educ. Psychol., 1940, 31, 360-368. 

2. Guthrie, W. S., Study of applications to medical schools, J. higher Educ., 
1949, 20, 100-101. 

3. Keys, N., The value of group test 1.Q.'s for 
yond high school, J. educ. Psychol., 1940, 31, 81-93, 

4. Paterson, D. G., C. ФА. Gerken, and M. E. Hahn, Revised Minnesota 
occupational rating scales, Minneapolis, Minn., University of Minnesota Press. 
1953. 


5. Stewart, Naomi, A. G. C. T. scores of army personnel grouped by occupa- 
tions, Occupations, 1947, 26, 5-41. 


ary school intelligence quo- 


prediction of progress be- 


SUGGESTED ADDITIONAL READING 


Bennett, George K., H. G. Seashore, and A, G, Wesman, Counseling fr ij 
Profiles: a casebook for the Differential Aptitude Tests, New York, Psychological 
Corp., 1951. 


Darley, John G., and Gordon V. Anderson, The functions of measurement 
in counseling in E. F, Lindquist, Editor, Educational measurement, Washing- 
ton, D. C., American Council on Education, 1951, Chapter 3. 

Froehlich, Clifford P., and John С. Darley, Studying students, guidance 
methods for individual analysis, Chicago, Science Research Associates, 1952. 

Rothney, John W. M., and Bert A. Roens, Guidance of American youth; 
an experimental study, Cambridge, Mass., Harvard University Press, 1950. 

Super, Donald E, 7 


> Appraising vocational fitness, New York, Harper, 1949, 
Chapters 19, 23, 


QUESTIONS FOR DISCUSSION 509 


QUESTIONS FOR DISCUSSION 


1. If you were a counselor, how would you use an expectancy table, such 
as that given in Table 18.1? How should a guidance worker in high school 
use the data of Table 9.3? 

2. How specific is the vocational guidance that can be given on the basis 
of scores from a battery of ability and interest tests? What other factors 
would need to be taken into account in helping a student to formulate voca- 
tional plans? 

3. What obstacles to communication is a counselor likely to encounter? 
What steps can be taken to overcome these? 

4. In reporting test results to a counselee, how specific should a vocational 
counselor be? 

5. What sort of validity data about an aptitude test would be most useful 
to a vocational counselor in giving guidance to a student? How should this 
information be organized and presented to the counselor for his use? 

6. The son of the local banker got the following scores on the Differential 
Aptitude Test Battery and the Kuder Preference Record administered in the 
eleventh grade. What tentative plans seem suitable in the light of the test 
scores? How definite should these plans be at the present time? What 


further information should be sought? 


Percen- Percen- 

DAT Subtest lile Ruder Scale tile 

Verbal Reasoning 95 Scientific 68 
Numerical Ability 70 Outdoor 82 
Abstract Reasoning 90 Computational 36 
Space Relations 80 Clerical 43 
Mechanical Reasoning 85 Literary 72 
Clerical Speed and Accuracy 60 Artistic 45 
Language Usage: Spelling 90 Musical. 38 
Language Usage: Sentences 95 Persuasive 67 
Mechanical 35 


Social Service 18 


son of a truck driver had the following 


7. In the same eleventh grade the had the 
al counseling in his case? 


Scores, What would be the objectives of vocation 


Percen- Percen- 

DAT Subtest tile Kuder Scale tile 
Verbal Reasoning 12 Scientific 26 
Numerical Ability 3 Outdoor : 88 
Abstract Reasoning 14 Computational 25 
Space Relations 28 Clerical 58 
Mechanical Reasoning 45 Literary 1 
Clerical Speed and Accuracy 40 A rtistic a 
Language Usage: Spelling ч t Es 
Language Usage: Sentences 1 E 


Social Service 55 


Chapter 19 


Tests in the Selection 
of Personnel 


One major function that tests have come to serve in the United 
States is that of screening applicants for a training program or a 
job. Colleges and professional schools use standardized measures of 
achievement and aptitude as at least partial bases for deciding which 
applicants to admit. Vast numbers of civil se 
filled on the basis of competitive examination. 
for many jobs in terms of their performance 
abilities. Such a selection program has 
the average level of achievement of tho 
training and the job and minimizing the 

The school or employe 
selection testing 


rvice positions are 
Industry selects men 
on tests of relevant 
as its objective maximizing 
se who are accepted for the 
occurrence of failures. 

r that Proposes to introduce a program of 
faces a number of issues. What is the best test to 
use, in terms of effectiveness versus cost? Should 
used or should it be supplemente 


a single test be 
d by others? If more than one test 
is to be used, how should the tests be combined in order to produce 


the most efficient team? How should test results be combined with 
non-test data about the individual? 


STEPS IN SETTING UP A SELECTION PROGRAM 


The basic pattern of se 


forward. You decide what types of measures are promising as pre- 
dictors of success in the training program or job with which you 
are concerned. You make a judgment 
cepted as an index of success in the 
buy or make the tests 


lection research is simple and straight- 


аз to what can best be ac 
training program or job. You 
and administer them to a group of applicants 
You get measures of success for these same applicants after they 
have had a period of experience in the training. program ог job. 
You determine the relationship of each predictor to the criterion 
measure of success. You pick the best predictor or predictors and 
use them to screen future applicants, 


510 


PICKING TRYOUT TESTS FOR SELECTION RESEARCH 511 


That is the basic pattern. There are, however, a number of issues 
that arise at each step in the proceedings. Some involve complex 
statistical problems that we cannot go into here. However, we shall 
consider the operations step by step and try to anticipate some of 
the recurring problems. 


PICKING TRYOUT TESTS FOR SELECTION RESEARCH 

How shall we decide what sorts of tests to try out as predictors of 
success in a given training program or job? Of course, we may have 
some hunches based on our familiarity with the school or job. The 
very fact that it is a school, for example, suggests that some type of 
scholastic aptitude test would be appropriate. But if we are to refine 
our crude original hunches, we can do it only by studying the school 
program or job duties. We carry out what has been called a job 
analysis. The term job analysis is a somewhat ambiguous one. It 
Covers assorted techniques of studying jobs for one or more of a 
variety of purposes. The purpose may be to determine salary sched- 
ules, to improve safety procedures, to develop training programs, 
or to define ladders of promotion. It may also be to describe the 
tasks done on the job and to estimate what abilities and personal 
qualities are required to do those tasks well. The job analysis with 
Which we are concerned focuses on this last type of information. 

"There is no special magic technique for job analysis. The analyst 
operates by going out and observing people working at the job, by 
talking to them about what they do, by examining the tools they 
have to use and the textbooks or instruction manuals they have to 
read, and by observing the conditions under which they have to 
Work. His problem is to get a complete description of the job. | From 
this and from his background of knowledge of human abilities, he 
Organizes his hypotheses as to the abilities that are important for 
the job. These statements of job requirements are refined guesses 


based on scrutiny of the job. 


Given a set of educated guesses 
à job, the next step is to translate them into actual test procedures. 


Often, the practical step will be to try some of the ready-made 
tests that appear to measure abilities much like those suggested by 
the job analysis. The sources suggested in Chapter 8 will identify 
available instruments and provide information about them. : 
In other cases, there may be no existing test that appears to fill 
the bill. It is then necessary to try to invent test tasks that will 
tap the functions whose importance was indicated by the job analysis. 
Test specifications must be prepared and tasks or items constructed. 


as to the abilities important for 


512 TESTS IN PERSONNEL SELECTION 


Many of the guide lines set down in Chapters 3 and 4 will apply, 
though tests of special aptitudes may differ quite a bit from tests 
of school achievement. Usually, it will be desirable to try out any 
new test on preliminary groups. Answers will be needed for such 
questions as the following: 


1. Are the instructions sufficiently clear and detailed and are there 
enough practice items? 

2. Are the time limits appropriate? If this is a speed test, are 
there enough materials to keep everybody busy? If the test empha- 
sizes primarily power, are the limits long enough so everyone has a 
chance to try most of the items? 

3. Are the separate test items satisfactory? Administration of the 
test to a preliminary group and analysis of the responses to each 
item by high- and low-scoring individuals is usually desirable as a 
means of eliminating items that fail to discriminate or that are too 
easy or too hard. 

4. Does the test measure with at least moderate reliability? If 
reliability is very low, steps to improve it by lengthening the test 
or by revising and selecting items will usually be indicated. 


IDENTIFYING A SUITABLE CRITERION MEASURE 

If we are to evaluate a test or some other type of predictor, We 
must have something against which to evaluate it, When we are 
dealing with success in college, professional school, or some type © 
training program, marks in courses or grade-point average are usu- 
ally ready at hand. We take them more or less for granted and use 
them as our criterion measure of success. This is good enough as 
far as it goes. We may have certain reservations about course grades 
as a standard of success, but the judgments they represent are at 
least a first approximation to the objectives of the нали pro^ 
gram. 

W hen our objective is to select persons for a job, as distinct from 
a training program, the problem of a criterion ранните becomes 
much more troublesome. The novice in the field is likel ‚ to assume 
that he need only look and he will find r 8 
production record or fitne 
each employee is. 


cady at hand some suitable 
Ss rating to tell him how good a worker 


The truth of x 48 
Я the matter is that existing records 
are rarely satisfactory and that bette uc gne 


One function of a job analysis is to 
records that might serve 


г ones are hard to come by: 
: explore and evaluate existing 
other better proced E indiees of job success and to look for 
procedures that might be substituted for them. Possible 


IDENTIFYING A SUITABLE CRITERION MEASURE 513 


indicators of success that we might wish to use include (1) academic 
or training school grades, (2) proficiency tests. (3) performance 
records, and (4) ratings by supervisors or associates. 

. Academic Grades. Grade in a training program provides a fairly 
simple and straightforward measure that is available with little 
factory reliability. The sad thing is that 


delay and is usually of satis 
at there is often little 


a summary of evidence to date ' indicates th 
correspondence between the tests that have high validity for a train- 
ing criterion and those that predict success on the job itself. The 
tests of verbal, quantitative, and reasoning abilities that are good 
predictors of ability to learn are not comparably good measures of 
job performance. If the selection tests emphasize abilitv to absorb 
the training, they are likely to be relatively inefficient in picking 
persons who will later be judged to be good workers. Of course, 
when successful completion of a training program is a prerequisite 
for entry into the job, a certain type of administrative validity is 
automatically given to this type of criterion. 

Job Proficiency Tests. Suitable job proficiency tests are rarely 
available, and the preparation of a test measuring job competence 
undertaking. In some jobs, accountancy for 
s and skills of the job can be reduced 
to test tasks. In others. such as selling, à test has quite limited 
Possibilities. Often proficiency tests will need to be performance 
tests or performance checks. Thus. the competence of an airplane 
| with some success by having a skilled 


being tested as he demonstrates various 
plane maintenance and repair. A proficiency 
rtain job knowledges and skills; it cannot 
| will apply them at work on the 
ertain tasks, but it cannot tell 


represents quite an 
example, many of the knowledge: 


Mechanic may be evaluatec 
mechanic observe the person 
key procedures in 
lest can at best measure ce 
tell how effectively the individua 
Job. It may tell how well he ca? do с 
how well he will do them. 


Production Measures. The measure ; 1 
always likely to turn first when seeking à criterion measure is some 


record of job output. The number of widgets produced per hour, 
the monthly sales of gilhickeys, or the number of defects per hundred 
whortlebugs seem like sound indices of the quality of a — In 
Some cases, performance re used to advantage. | ow- 
ever, there are many jobs for which no simple performance " asure 
can be found. The receptionist, the bank teller, the department 
Supervisor, the plumber, and the teacher are doing Deit sen we 
can hardly find any product to count or score. The pe êt is too 
varied or intangible to provide us with an acceptable criterion. 


of job success to which one is 


cords can be 


514 TESTS IN PERSONNEL SELECTION 


Even in those jobs in which there is some product to count or score, 
there are many pitfalls in using the product as a criterion measure. 
We mav consider several briefly. 


1. The product тау depend upon other people. Thus, during 
World War II attempts to use bombing records as a measure of the 
competence of a bombardier broke down in part because where the 
bomb fell was affected by the way the pilot flew the plane. Again, 
the sales of an insurance agent may depend upon the type and 
amount of supervision and help he gets from the agency manager. 

2. Outside conditions may vary from person to person, The 
quality of equipment may be important—new tools versus old, good 
maintenance versus poor. Prosperity of the neighborhood may be 
a factor in any measure of sales volume. 

3. A sample of performance may be quite unreliable. Thus, ac- 
curacy of bombing by a single student showed wide variations from 
day today. A limited sample of 
undependable. 

4. The performance measure May represent only a limited aspect 
of the job. Thus, for a life-insurance 
sales may be less important than the permanence of the sales. There 
is no profit in lapsed policies. Tt might be possible to keep a record 
of the output of dictation and typing for 
taries, but this would take no 
membering appointme 


production or sales may: be similarly 
salesman the dollar volume of 


cach of a group of secre- 
account of her dependability in re- 
nts or her diplomacy in answering the phone. 


An actual record of performance is undoubtedly an attractive 
candidate as a criterion of job proficiency, But performance records 
need careful Scrutiny in terms of such considerations as those men- 
tioned above. If the measure holds up under scrutiny, as it some- 
times will, it can be used as a standard against which predictor meas- 


ures can be appraised. But if no production records are available 
or if existing records are fallible for one reason or another, we shall 
have to look elsewhere for our criterion me 


asure, 
Ratings of Job Proficiency. In actual practice, the selection re- 
search worker is often 


‹ thrown back upon ratings for lack of any 
more satisfactory criterion measure, It is almost always possible 
to arrange to get some type of rating, usually by instructor or super- 
visor. Ratings are applicable to almost any type of job. The fact 
that a rater can synthesize different aspects of achievement in one 


judgment and can make allowances for special external factors that 
may have favored or handicapped 


the worker is in some ways an 
advantage. However, the limit 


ations of ratings as evaluations of 


ADMINISTERING TESTS FOR VALIDATION 515 


competence are many. We have discussed them in detail in Chapter 
13 and need not repeat that discussion here. It will suffice to say 
that the reliability of criterion ratings is often low, and they are 
frequently biased by factors not truly related to competence. Vari- 
ous of the techniques of analytical check lists or forced-choice judg- 
attempt to overcome these limitations. 
laborious. It is 
ria of pro- 


ments have been applied in an 
These procedures have some promise but are quite 
the fact that in personnel research obtaining good crite 
ficieney may often call for more skill and effort than developing pre- 


dictor tests. 


ADMINISTERING TESTS FOR VALIDATION 

Once tests have been selected for tryout and plans have been 
made to collect as good criterion data as circumstances permit, the 
tests should be administered to a group on à research basis. Ideally, 
lests are given to persons before they start on the job or training 
Program. If the tests are given to individuals who are on the job 
or who have already taken part of the training course, we cannot 
say how much of any relationship we may find is due to actual ex- 
Perience in the job or training program. Thus, if we give a reading 
test at the end of the freshman year in college and find that those 
With high scholastic averages are good readers, we are never sure to 
what extent they did well in college because they were good read- 
ў ent they learned good reading skills 
e The motivation 
may also be dif- 


ers to start with or to what ext 
as they worked effectively on th 
9f applicants and of accepted students ог workers 
ferent, and this difference тау distort the results. 

It is always logically preferable to try out tests on a group that 
has yet to start on the job or training program. However, this 
Procedure raises certain practical problems. Gathering data in this 
Way is a slower process. There is always а delay of months, even 
Cars, while the persons tested. are completing their training or get- 
ting well enough established in the job so that we can get а reason- 
able measure of their competence. Flow of new personnel into a job 
May be so slow that a long time is required to accumulate a sufficient 
Reaching examinees for follow-up months or 
For these practical reasons, tests are 
king in a job. But re- 
ntative as evidence 


cir college courses. 


8 


Sample of applicants. 
Years later тау be difficult. 
Sometimes tried out on groups 
Sults for such a group must cape in 
Оп the predictive effectiveness of the tests. чы | . 

One sir am 1 is serious in validating tests in a. joli 
Context is that of getting groups of adequate size. The accuracy 


already wor 
considered te 


516 TESTS IN PERSONNEL SELECTION 


with which relationships can be established depends upon the size 
of the sample upon which statistics are based. The precision of the 
correlation coefficient is illustrated in Table 19.1. Thus, if we have 


Table 19.1. Range Outside of Which Sample Correlation Coefficient Will 
Fall 50 Per Cent of Time for Samples of Different Sizes and Different 
Values of True Correlation 


True Value Size of Sample 
of - — z - 

Correlation 25 50 100 200 400 
00 — .143-+.143 — .068- + .068 — 048 +.048 — 034- 4-034 
20 058-333 264 153 467-232 
.30 164-425, -208-.387 362 250.343 20030 
40 273.514 341486 359 440 371-420 
50 .384-.600 .447-.550 463-.536 474.525 
.60 -500-.684 :534-.65 -555-.642 .508 630 578.621 


a sample of only 25 cases and the 
the total population of all cases 
0.20, we stand a 50-50 chance 
0.333 or below 0.058. Тһе 
preted in the same way. 
Clearly, the larger the sample the 
will be with respect to which test 
large a sample do we need? This is the old question: How high is 
up? The only answer we can give is: The more the better. But 
there is probably a lower limit below which it doesn't pay to carry 
out statistical analysis of tests as predictors. At some point, our 
rational judgment based upon the nature of the tests and the nature 
of the job is probably more dependable than the empirical results 
from the small sample. We would judge that there is rarely any 
profit in computing predictor-criterion correlations for groups of 25 


or less and that the value of analyzing groups of under 100 is often 
questionable. With samples as small 
as much trust in our judgme 


true value of the relationship in 
is represented by a correlation of 
of getting a value that is either above 
other entries in the table are to be inter- 


more dependable our conclusions 
5 to select as predictors. How 


as this, we can often put about 
nt as in our statistics, 


STATISTICAL ANALYSIS OF SELECTION TEST DATA 


| For the research worker analyzing several predictor tests in rela- 
Поп to à certain criterion of job success, the essential statistic 18 
the correlation of each Predictor with the criterion, The higher 
the correlation, the more effective js the predictor E identifying 
those who will do well on the criterion measure. We shall illustrate 
this point—and also a number of other issues that 1185 in using 
predictor test data—with a small set of actual data. 


STATISTICAL ANALYSIS OF SELECTION TEST DATA 517 


In the course of some research on electronics personnel, the deci- 
sion was made to try out test materials dealing with (1) mathematics, 
(2) shop knowledge, and (3) electricity. The criterion measure in 
this case was a composite of grades received in an 8-month training 
program. Data are analvzed here for a sample of 99 students. Some 
were eliminated for academic failure before completing training, and 
these were assigned grades below the lowest of those graduating. 
Some of those who graduated had grades so low that they were des- 
ignated as marginal. In all, 30 cases fell in this failed or marginally 
satisfactory group. The correlations of the three brief tests with 
the academic grades criterion were as follows: 

Mathematics 0.40 (20-item test) 
Shop 0.30 (10-item test) 
Electricity 0.58 (15-item test) 

The numbers of unsatisfactory (failed or marginal) and satisfac- 

tory students at each score level are shown in Table 19.2. To see 


Table 19.2, Number of Men Receiving Satisfactory and Unsatisfactory 
Grades in Electronics Training at Each Score Level on Three Selection Tests 


Mathematics Shop Electricity 
Score Unsatis. Satis. Unsatis. Satis. Unsatis. Satis. 
1 1 1 
2 1 2 
3 1 1 7 3 1 
4 2 2 9 1 
5 1 4 11 4 3 
6 1 1 7 13 4 2 
7 1 4 5 12 4 2 
8 1 4 2 10 10 9 
9 5 1 1 6 6 11 
10 4 2 1 11 
11 1 3 9 
12 4 5 $ 
13 3 : 
14 1 4 Н 
15 3 8 
16 1 8 
1] $ 12 
18 2 9 
19 1 
20 1 
.30 .58 


Correlation .40 


518 TESTS IN PERSONNEL SELECTION 


what these correlations mean in practical terms, let us consider two 
levels of cutting score. Suppose that we are setting cutting scores 
to eliminate (1) about one-third of the unsatisfactory cases and (2) 
about two-thirds of the unsatisfactory cases. Considering cach test, 
what would be the cost in loss of individuals who would become satis- 
factory graduates? The results are summarized below. 


Low Cut-Off High Cut-Off 
Min. Failures Successes Min. Failures Successes 
Score Elim. Lost Score Flim. Lost 
Mathematics 9 33. 14.5% 14 66.7% 37.7% 
Shop 5 36.7 21.4 ih 13.3 56.5 
Electricity 7 30.0 8.7 9 76.7 24.7 


This little table shows the relationship between the validity co- 
efficient for the test and its practical effectiveness. The difference 
in the three selection tests shows up most clearly at the higher cut- 
off. At this point, using the electricity test, we could screen out 
76.7 per cent of the unsuccessful group at a cost of only 24.7 per cent 
of our future successes. By contrast, the shop test screens out 73.7 
per cent at a cost of 56.5 per cent. 


The greater efficiency of the 
electricity test is clearly evident, and if we could use only one test 
this is the one that we should keep. 

Combining Tests. When we have used several tests as predictors, 
a question that we often face is whether we 


should be content with 
the one best test or whether we should use 


One 2 more than one. If the 
decision is to use more than one, we must then decide how many to 
use and which ones. А full exploration of these problems leads into 


complexities which we cannot consider here and, indeed, brings US 


atistical problems. However, We 
can open up some of the main approaches to the problem. 

In our illustration, the problem we face is whether to use only 
the electricity test or whether to g 
matics and shop tests. 


face to face with some unsolved st 


ive some weight to the mathe- 
The extent to which the math and shop 
tests will be useful will depend upon the extent to which they are 
measuring abilities different from those measured by the electricity 


test. If they are measuring essentially the same abilities as those 


tapped by the electricity test but not measuring them as effectively. 


there is no point in adding them on to our battery. However. if 
they are measuring different aspects of our criterion, then pooling 


the several measures should give us more complete coverage of 


STATISTICAL ANALYSIS OF SELECTION TEST DATA 519 


the essential abilities and, consequently, better prediction of the 
criterion. 
To determine whether the predictor tests are measuring the same 


or different abilities, we must look at the correlations between them. 


These are: 
Mathematics versus Shop —0.02 
Mathematics versus Electricity 0.37 
Shop versus Slectricity 0.30 


Thus, the electricity test shows some overlap with each of the other 
There is most in common between 


tests but not a very great overlap. 
sts. The mathematics and shop 


the mathematics and electricity te 
tests are almost completely unrelated. 

Is the overlap of electricity with each of the other two tests so 
great that they can add nothing of value to our prediction? To 
answer this question, we may compute a statistic known as the 
Partial correlation. The partial correlation is a measure of the rela- 
tionship which remains after the effect of one or more other factors 
is removed.* In this instance, it is the correlation of academic grades 
with math or with shop after the common influence of the electricity 
test score is removed from the picture. These partial correlations 
are; 
ty score partialed out 0.25 


Math versus grades. electrici 
y score partialed out 0.16 


Shop versus grades, electricit 


has some validity independ- 


ach of the other tests 
test, though 


yn with the electricity 
old in common with the electricity 


Thus, we see that e 
ent of the part held in comme 
eliminating the part which they h 
test has reduced the validity of each. 

The Validity of a Com posite. We must ask now how much we 
bv using two tests or all three, combining 
4 Vor this we can compute the 
ation is the maximum pre- 
combination of scores 


could gain in validity 
them in the most advantageous way. 
multiple correlation. The multiple correl 
diction that can be obtained from an additive 


„The f. . ПМЕ: 
The formula for partial correlation 15 


and 2, with the effect variable 3 


л of variables 1 


Where тү». is the partial correlation 
removed. 
and 2, 1 and 3, and 2 and 3, 


iables 1 
ria, rig, Pos are the riable 


correlation of va 


respectively. 


520 TESTS IN PERSONNEL SELECTION 


on two or more tests.* In our example, the multiple correlations for 
combinations of two and three tests are as follows: 


Electricity and Mathematics 0.615 
Electricity and Shop 0.596 
Mathematics and Shop 0.500 
All three tests 0.634 


Thus, we see that the combined tests give a somewhat higher cor- 
relation (0.634) with the criterion than does the best single test 
(0.58) if the tests are combined with the most appropriate weights. 

Weighting Tests. Our next problem is to determine the best set 
of weights. These are known as regression weights. They are best 
in the sense that they reduce to a minimum the errors in predicting 


the criterion score. f For our set of three tests the regression weights 
are, respectively, 


Mathematics 0.24 
Shop 0.17 
Electricity 0.44 


These are the weights we should use if all our tests had the same 
standard deviation (for example, if they were all in standard score 
units). However, when the tests are in raw-score units, we must 
take account of the standard deviation. A test with a large standard 
deviation already receives a heavy weight just from the variability 
of its scores. The relative weights to be applied to raw scores arc 
the regression weights divided by the corre 


sponding standard devia- 
tions. For our example, we have the 


following: 


Standard Regression Raw-Score Integral 

Deviation Weights Weights Weights 
Mathematics 4.40 0.24 0.055 2 
Shop 2.06 0.17 0.083 3 
Electricity 2.76 0.44 0.159 6 


* For two predictors the multiple correlation is given by the formula 


тыш — ME — (1 — en 

where vi indicates the multiple correlation of 1 with 2 

partial correlation of 1 and 3 with 

plex cases will be found in standard 

f With two predictor variables, 
scores are given by the formula 


zx Mes 
and 3, and ris.» is th 
2 held constant. The formula for more com- 
statistics textbooks, 


the regression weights to be applied to standard 


Ti? — rjarog 

Big = ——.— 
l— rg 

Tw — Peres 

613.2 = ые, 
1 وور‎ 


The formulas and computing procedures 


" 1 
for three or more predictors will be foun 
in standard statistics textbooks. 


STATISTICAL ANALYSIS OF SELECTION TEST DATA 521 


. The weights in the final column of the above table are simple 
integers that stand in very nearly the same ratios as the raw-score 
weights. "They are more convenient to use than the decimal weights 
and are as good for all practical purposes. lf we wish to combine 
ү separate tests, we сап изе these integers as weighting factors to 
be applied to the scores on the three tests. Thus, we could take 2 
times the mathematics score plus 3 times the shop score plus 6 times 
the electricity score and prepare a composite score for each student. 
We have calculated composite scores using the above multiplying 
factors. These composite scores were correlated with the criterion 
and in this instance the result checked perfectly with the predicted 
correlation of 0.634. To see what, if anything, we really gained by 
the pooling, we may prepare another table like Table 19.2. We 
Table 19.3. Number Passing and Failing Electronics Training at Each 
Composite Score Level 
(2 Math + 3 Shop + 6 Electricity) 

Score Unsatisfactory Satisfactory 

50-59 
60-69 
70-79 
80-89 
90-99 
100-109 
110-119 
120-129 
130-139 
140-149 
150-159 


кә мз O O utl 


кә ب جک‎ © — О о 00 Q) t9 = 


19.3. Repeating our calculations of cost and 


have done this in Table 
find that we can 


gain from two levels of cutting score, We 
Eliminate 23.3% of failures at a cost of 4.3% of successes, or 
% 70.0% " ~“ woe „ 20.3% 


Comparing this accomplishme 


nt with that for the electricity test 


alone, it is hard to see апу difference between. them. The small 
lation does not show up as any clear improvement 
of this small size. With a large 
sumably be noted. 


difference in corre 
in practical effectivene 
sample, some improveme 


ss in a sample 
nt would pre 


522 TESTS IN PERSONNEL SELECTION 


PROBLEMS IN THE USE OF SELECTION TESTS 
TWO WAYS OF USING TWO, OR MORE PREDICTORS 


In the last section, we showed how predictor tests could be used 
two or more at a time by multiplying each test score by an appropri- 
ate weight and adding them together to give a single composite score. 
This we shall call the method of linear combination, since it is based 
on a simple linear equation of the type ax + by + cs. If we used 
this method mechanically, we would employ or accept for further 
education those individuals with the highest scores, going down the 
line until we had enough to meet our quota. 

Another way of proceeding would be to set separate qualifying 
scores for our separate measures and accept only those individuals 
who qualified on each hurdle. Thus, we might specify that each 
applicant must get at least the following scores: 


Mathematics 
Shop 
Electricity 


M دي‎ сл 


This procedure would scree 


n out 10, or 33.3 per cent, of the failures 
at a cost of 8, or 11.6 pe 


r cent, of the successes, Higher minimum 


scores of 
Mathematics 8 
Shop 4 
Electricity 8 


would eliminate 23, or 76.7 per cent, of the failures at the cost of 16. 
or 23.2 per cent, of the Successes. In this illustration, the separate 
cutting scores represent little or no improvement over the clectric- 
ity test alone. The Procedure appears about as effective as using 
à single composite score, 
In terms of statistical theory, the use of Separate cutting scores 
usually seems less sound than the procedure of linear combination. 
The one exception to this is When some minimum level of a particu- 
lar trait is absolutely essential for a job, but additional amounts 
are not of great importance, Furthermore, the cutting scores must 
be determined by an essentially trial-and-error process, and once 
they are set they are rather inflexible. However, application of the 
separate minimum requirements is probably simpler for an untrained 
person than is computing a combined score. 
The real practical advantage of separate 
when one of the predictor tests is expe 
apply. Then, a simpler and more e 


cutting scores comes 
nsive or time-consuming to 
conomical test may be applied 


PRESELECTION AND TEST VALIDITY 523 


to the total group and part of the group may be screened out by 
this economical procedure. The more costly appraisal device need 
be applied onlv to the remainder. Thus, if a written test of subject- 
matter knowledge and a performance test of actual classroom teach- 
ing were being used in the selection of secondary-school teachers, it 
would be verv reasonable to use the written test as an initial screen- 
ing device and to use the performance test only with those who met 
minimum standards on the written test. 


SELECTION RATIO AND TEST EFFECTIVENESS 

The minimum scores that are set for the separate tests or for the 
composite score in any practical testing program will depend to a 
considerable extent upon the law of supply and demand. When 
applicants are few and vacancies many, lower requirements must be 
Set; when applicants are many and vacancies few, the selecting 
agency can afford to be particular. This ratio of acceptance to ap- 
plication is called the selection ratio. 

The practical value of a testing program depends as much upon 
the selection ratio as it does upon the validity of available tests. 
In the extreme case in which we can afford to reject nobody, even 
the most valid test is of no value as a selection device. At the other 
extreme, if we need take only one applicant in ten, for example, the 
use of even a test with rather low validity will be quite beneficial. 
e took only 10 out of our 99 applicants, 


Thus, in our example if w 
d in 90 per cent. of the 


even the shop test gave us men who succeede 
cases. Under these same conditions, the electricity test gave 100 


per cent successes. This compares with 70 per cent of successes in 


the total group. 
With a low sele 
Program gives promise of 1 
ratio is high (most applicants 
evitably be much smaller. 


ction ratio (few applicants accepted), a testing 
arge practical gains. When the selection 
accepted), the practical gains will in- 


PRESELECTION AND TEST VALIDITY 
There is one factor whose influence on test validities we must 


mention here. The factor is a pervasive one that distorts the inter- 
pretation of tests in a good deal of personnel research. Again, the 
effect is а complex one that we cannot explore fully. The factor 
With which we are now concerned is that of preselection of the group 


on which we get criterion data. 
In our illustrative example, su : 
effect that по one with a mathematics test score 


ppose that a regulation had been in 
below 10 would be 


524 TESTS IN PERSONNEL SELECTION 


admitted to the training program. This would have eliminated 26 
of our 99 cases and 15 of our 30 failers. The spread of mathematics 
test scores within the group that remained would be substantially 
less, and there would also be less spread in grades. Under these 
circumstances, the correlation of test with criterion will normally 
be reduced. In this instance it drops from 0.40 to 0.15. Thus, if 
we had had a selection program in effect and had admitted onlv those 
with high math test scores, within the admitted group the math 
test would have appeared almost useless. Its true value would not 
have been changed, but the evidence available to us would not have 
permitted us to see that value. Those who would have failed because 
of math deficiency would have been cut off at the source. 

We cannot indicate in this discussion the mathematical procedures 
to correct for preselection. In practice, these cannot always be ap- 
plied in any case, because we do not fully know the nature and ex- 
tent of preselection that has been operating. We can merely give a 
few general guiding principles. 


1. When those admitted to a job or training program have been 


selected on the basis of score on some test, the apparent validity of 
that test for those remaining will be 


reduced. 
2. The smaller the selection ratio (i.e., the 
the greater will be the reduction in 
3. Selection will also operate 
that are correlated with the 


higher the cutting score), 
apparent validity. 

to reduce the validity of other tests 
test used for screening. 

4. In the correlated tests, the reduction will be in proportion to 
the correlation of the second test with the test used for screening. 
It will also be in Proportion to the degree of selectivity. 


If a school, military training program, or employer is installing а 
Program of selection tests and must choose the best tests to use on 
the basis of validity statistics, it is important to try to appraise and 
allow for the effects of preselection in interpreting these statistics. 
A thorough discussion of both the logic and statistics of the problem 
has been supplied by Gulliksen. For the present, it will suffice to 
point out the problem and warn the reader of its importance. 


RATIONAL VERSUS EMPIRICAL BASES FOR WEIGHTING TESTS 
In preceding sections we have 


outlined procedures for deciding 
which tests to use and how much 


cl A Weight to give to each, basing the 
decision entirely on the empirical evidence from trying out the tests 
For several reasons, however, it may not be desirable to be guided 
entirely by the validity data. The criterion measure will usually be 


THE PLACE OF CLINICAL JUDGMENT IN SELECTION PROGRAMS 525 


imperfect, i.e., incomplete or biased in some respects. The sample 
of cases may be rather meager. The empirical results may be dis- 
torted by the preselection effects discussed in the previous section. 
For these reasons, we may want to give some weight to our rational 
analysis of the situation as well as to the empirical evidence. Thus, 
in discussing selection tests for medical schools, Stalnaker says: ^ 


While I should be unwilling to discourage anyone from correlating any 
pressed nor concerned when a low corre- 
anding modern society 
І continue to favor 
awareness of 


two variables, I am neither im 
lation is found between scores on a test in underst 
and grades in laboratory work in gross anatomy. 
selecting the men for the study of medicine who have some 
soci il sciences. 

I, the average grades in first-year medical school do not correlate highly 
with a score which may crudely be representative of intelligence, I shall 
not conclude that a stupid M.D. is as good as a bright one as far as diag- 
nosis of disease in my personal family is concerned. 


limitations of the criterion are recognized. The 
al analysis is as important 
statistical computations. 


In this example, 
writer is expressing his belief that ration 
in defining sound selection procedures as are 


COMBINING TEST AND NON-TEST DATA 

lection, consideration will usually 
need to be given to factors other than test scores. We have in mind 
such things as personal history data, educational or previous work 
record, and impressions or evidence gathered in a personal interview. 
In practice, this type of material is often used (1) without any sys- 
tematic evidence of its validity and (2) in a rather haphazard and 
There is no real reason why personal data or work 
d, or why the impressions gained 


In any program of personnel se 


Intuitive way. 
history items cannot be scored or rate 
in an interview cannot be reduced to some quantitative estimate of 


probable success in the training program or value on the job. If 
this is done, the resulting scores can be evaluated in exactly the same 
valuated. If they prove to have useful 
hted in a composite score together with 
tests, That is, qualitative data can be first converted into quanti- 
tative terms and then pooled with other types of quantitative data. 
This would appear to be a sound extension of the research approach 


to personnel selection. 


Way that test scores are € 
validity, they can be weig 


MENT IN SELECTION PROGRAMS 


far for the use of tests 
ave emphasized 


THE PLACE OF CLINICAL JUDG 
been proposing so 


The procedures we have 
ersonnel selection h 


and even of qualitative data in p 


526 TESTS IN PERSONNEL SELECTION 


uniform and essentially rigid procedures for pooling and evaluating 
the evidence on each case. This is likely to be offensive to the per- 
son who prizes his clinical judgment and would like to temper the 
decision in individual cases by that judgment. He is likelv to feel 
that he can "beat the game" and make predictions that will be more 
accurate than those given by mechanical application of a set of re- 
gression weights. 

Concrete evidence suggests that this is not generally sos Where 
adequate empirical evidence is on hand to permit setting standard 
procedures for weighting and combining test results and other evi- 
dence, the mechanical combining usually gives more accurate pre- 
diction of a definite criterion than does an intuitive weighting, and 
persons for whom an exception is made on an intuitive basis do no 
better than their test scores indicated for them. 

Ап intuitive appraisal of the individual, as bv an interview, may 
serve a useful function as one of a team of predictors. As suggested 
in the previous section, this appraisal may be quantified as a rating 
on specific points, and the ratings may then be combined with other 
predictors. Again, as we indicated above, 
meager, rational and 


when empirical data are 
perhaps intuitive judgments may enter into 
original decisions as to the weighting of different aspects of evidence. 
But intuitive weighting of the evidence 
justifiable only on the grounds of e 
empirical evidence as to how diffe 
be weighted. 


for each applicant seems 
xpediency and lack of any sound 
rent elements of information should 


THE OPTIMUM CUTTING SCORE 


In any selection program, we face at some point the problem of 
deciding how selective to be. Shall we accept all but a few of the 
least Promising applicants or shall we admit only a small group of 
the most promising? There are always practical ‘limitations on how 
selective we can be, set by the number of applicants that it is pos- 
sible to attract and the number of vacancies to be filled. However, 
there is some flexibility int of recruiting done or in the 


filled by the first individuals who ap- 


in the amou 
speed with which vacancies are 
pear as applicants. 

In general, the more effort and expense we put into recruiting and 
testing, the higher we can set our cutting 
save in costs of training and efficiency of operation. Sometimes it 
may be possible to estimate the per capita cost of increasing the pool 
of applicants tested, on the one hand, and the per capita cost of train- 
ing a new employee, on the other. Bennett and Doppelt ? report an 


score and the more we can 


THE OPTIMUM CUTTING SCORE 527 


analysis of costs and savings for several different employee-selection 
projects. 

With food-store checkers, for example, the cost of testing an ap- 
plicant was figured to be $2.00 and the cost of training a new em- 
ployee to be $300. Basing calculations on the per cent of em- 
ployees at each score level rated as satisfactorv, they calculate the 


per capita cost of obtaining a satisfactory employee. Thus, if the 


620 Me 
ai 


eil 
560 


Cost per successful case 


is 
460 
x 
440}- 
1 Ж pr 1 
AUTOS, 1 71 81 91 101111121131 


21 31 41 51 6 
Total score on store personnel test 


Fig. 19.1. Curve showing cost of training satisfactory female checkers in a food-store chain. 
(From Doppelt and Bennett.) 


Cutting score on the Store Personnel Test is set at 90, their data indi- 
Cate that it is necessary to train 1.79 employees for each satisfactory 
One that is obtained and to test 3.89 in order to get the 1.79 to be 
trained. Using these figures and the costs of $300 to train and 
$2.00 to test, the cost per satisfactory employee becomes $544. 
Similar calculations at each score level yield the data of Fig. 19.1. 
Thus, in so far as the costs that have been considered give a true 
picture of all costs involved, the most economical procedure here 
Would be to set quite a high cutting score (about 110) and accept 
only those who fell above this score. In this particular instance, 
this would mean accepting only the top 10 per cent of applicants. ) 
Such calculations as this are only partially realistic because it is 
almost impossible to determine either all the costs or all the gains. 


528 TESTS IN PERSONNEL SELECTION 


Thus, in our illustration we did not show the costs of recruiting a 
large mass of applicants, and this might be substantial. At the same 
time, we did not show any long-term gains from having more efh- 
cient employees. The example shows the type of thinking that is 
involved in setting a cutting score. It isa balancing of one set of 
costs, tangible and intangible, against another, and setting a cutting 
score such that the most advantag 


cous balance of costs and gains 
will be reached. 


However, it is rarely possible to reduce the solution 
to a precise matter of dollars and cents. 


SUMMARY STATEMENT 


The basic pattern of personnel selection 


and personnel selection 
research is simple. F 


Promising predictors are identified. These are 
related to a suitable measure of job success. On the basis of the 
evidence, the most effective predic 
are set up for using the evidence 
cession. When future stude 


tors are selected, and procedures 
from them, either jointly or in suc- 
nts or emplovees are to be selected, the 
relevant evidence is gathered and combined by standard procedures. 
and those with the highest standing or those falling above specified 
minimum levels are accepted. However, many specific problems 
arise in connection with (1) discovery of promising predictors, (2) 
identification of a suitable measure of success, (3) the analytical 
procedures for studying predictor-criterion relationships, and (4) 
practical routines for using the evidence from predictors. Some of 
these have been considered in the present chapter. 

Certain general viewpoints may be expressed in closing. 

1. Any type of evidence, test or non-test, quantitative or quali- 
tative, may appropriately be used in a sclection program. 

2. Qualitative material should be translated into quantitative form 
by either a scoring or rating system. 

3. The validity of any type of e 
pirically in terms of its relationshi 

4. The empirical evidence on validity should not always be fol- 
lowed slavishly in setting up standard proce 
bining test data. Rational judg 
statistical evidence, and more 
considerations when the empirical evidence is le 

5. Once standard procedures have 
used in a standard way. Ctive impressions may enter in as 


evidence but should. not determine the way in which the evidence 
is applied to the individual case. 


vidence should be tested out em- 
› to criteria of job success. 


dures for using and com- 
ment should be used to temper 
weight should be given to rational 
ss satisfactory. 


been set up, they should be 
Subje 


QUESTIONS FOR DISCUSSION 529 


REFERENCES 


Р 1. Brown, C. W., and E. E. Ghiselli, The relationship between the predic- 
tive power of aptitude tests for trainability and for job proficiency, J. appl. 
Psychol., 1952, 36. 370-372. ў 

2. Doppelt, J. E., апа С. К. Bennett, Reducing the cost of training satis- 
factory workers by using tests, Personnel Psychol., 1953, 6, 1-8. 

3. Gulliksen, H., Theory of mental tests, New York, John Wiley & Sons, 
Ine., 1950. j | 

4. Kelly, E. I., and D. W. Fiske. The prediction of performance in clinical 
psychology, Ann Arbor, Mich., University of Michigan Press, 1951. 

5. Melton, R. S., A study of the relative accuracy of counsellor judgments 
and actuarial predictions, mer. Psychol., 1954, 9, 429-430 (Abstract). 

6. Stalnaker, J., Tests for medicine, in Proceedings of the 1950 Invitational 
pu on Testing Problems, Princeton, N. J., Educational Testing Serv- 
ice, 1951. 


SUGGESTED ADDITIONAL READING 


Doreus, Robert M., and Margaret H. Jones. Handbook of employee selection, 


New York, McGraw-Hill, 1950. 
Lawshe, Charles H., Jr., Principles of personnel testing, New York, McGraw- 


Hill, 1948, Chapters 2 and 3. 
Thorndike, Robert L., Personnel selection, New York, John Wiley & Sons, 


Inc., 1949, Chapters 1, 2, 5-10. 


QUESTIONS FOR DISCUSSION 


sures might be used 


1. Think of some job you know fairly well. What mec 
as a criterion of job succ ss? What are the advantages and limitations of 
each? 

2. Why is it important to have a | 
validity of a set of tests that have bee 

3. Why can combinations of tests predict better than a single test? 

4. What advantages do you see in using two or more tests with separate 
cutting scores, rather than combining the predictors by a regression equa- 
tion? Under what circumstances would separate cutting scores be most ac- 


arge group of cases when studying the 
n proposed for use as predictors? 


ceptable? ; ] 
5. For a given validity coefficient, how does the selection ratio affect the 

value of a testing program? ^ М | 
6. Under what circumstances might one decide to deviate from a regression 


equation in weighting tests for personnel selection? | 
7. A test was originally tried out on an unscreened group of job applicants, 


and for them it was found to have a validity coefficient of 0.50. Then it was 
put into use and used to screen out 50 per cent of applicants. What would 
you expect to happen to the validity coefficient in the group who were ac- 
сергей? Is the change real, or is it a statistical artifact? 3 
8. What considerations limit the use of results such as those shown In Ta- 


ble 19.1 to determine a cutting score? 


Chanter 20 


Measurement in Diagnosis 
and Therapy 


Among the important users of psychological and educational meas- 
urement devices are the clinical psychologist, the school psycholo- 
gist, the remedial reading specialist, and others who are similarly 
concerned with understanding the client who is having special diffi- 
culties. These individuals work in a clinical setting. The distinc- 
tive features of the clinician's work, as we sce it, are three-fold. 


1. Пе Is Concerned with the Individual Case. He is not dealing 
in statistical averages or general principles. He is not setting UP 
selection procedures that will work well in the total group. He has 
a particular person before him, and for the present his concerns 
focus entirely in that one person. 

2. He Is Oriented Toward Action. The circumstances require that 
he must do something. It is not sufficient that he organize a full 
description of a person, however profound and insightful the descrip- 
tion may Бе. He must plan what he is going to do by way of advis- 
ing, teaching, or treating that person. In spite of whatever doubts 
and uncertainties may still beset him, the 
must be a plan for action. 


3. He Must Synthesize Many Types of Information. Dealing with 
the individual case and having a responsibility for action, the clini- 
cian must take account of all types of relevant information about 


that case. The individual is a complex, integrated, functioning unit. 
Any action must be 


taken with respect to that unit as a totality- 
A child's reading disability cannot be separated from his social iso- 
lation. An adolescent's vocational interests cannot be separated from 
the parental pressures upon him. You cannot hire a man's intelli- 
gence without hiring also his overbearing 
nates. 


end result of his study 


attitude toward subordi- 


These features of the clinician's task bring to the fore a number of 


special problems. It is our concern in the present chapter to con- 


530 


THE NATURE OF DIAGNOSIS 531 


sider these problems and see what guides can be offered to the 
worker who must use measurement data as aids in diagnosing and 
planning remedial measures for the individual сазе. 


THE NATURE OF DIAGNOSIS 


The dictionary defines diagnosis in a specific sense as (1) "act or 
process of finding out what disease a person or animal has by exami- 
nation and careful study of the symptoms" and more broadly as (2) 
“careful study of the facts about something to find out its essential 
features, faults, etc., (3) decision reached after a careful study of 
symptoms or facts." The user of psychological tests and educational 
measurements is concerned. with diagnosis in both the specific and 


general sense. He may be concerned with using tests to identify a 
special pathology, as when certain tests of conceptual thinking are 
used to identify cases with organic brain damage. More often, he 
is concerned with unders anding the essential factors in some condi- 
tion, as when he studies the abstract intelligence and educational 
achievement of a child who is a school trouble-maker, believing that 
these may be essential factors in determining the child's behavior. 

At times diagnos 


has seemed to imply little more than naming. 
Thus, a patient in a mental hospital is diagnosed as a paranoid 
Schizophrenic, or a school bully 
ity. 


labeled a psychopathic personal- 
But there has been increasing unwillingness to be satisfied 
Simply with a name. 1 


aming is no more than classifying, and the 
classifications with which psychology and education have to deal are 
rather unsatisfactory. 

The value of a medical diagnosis of “malaria” lies in the great 
deal that is known about malaria and the large amount of essential 
resemblance among situations correctly classified as malaria. The 
classification carries with it the possibility of many useful and sig- 
nificant predictions. The course of future symptoms can be forecast 
and the response to specific drugs and types of remedial action can 

e anticipated. The value of any diagnosis lies in the predictions 
that it makes possible. 

. Diagnostic labels in psychology do not have very good forecast- 
Ing value because the categories cover a range of specimens that 
respond to treatments in vastly different ways. Thus, to label a 
Youngster a “psychopathic personality“ is of limited value because 
We are not able to make dependable predictions of reaction to treat- 
Ment for the psychopathic personality in the same way that we can 
Predict the malarial patient's reaction to quinine. 


532 MEASUREMENT IN DIAGNOSIS 


For this reason, it is important that we think of psychological and 
educational diagnosis less as a process of labeling and more as a 
process of “careful study of the facts about something to find out the 
essential features." The emphasis upon essential features, rather 
than upon labels, should help us to make more useful predictions. 
For a diagnosis justifies itself only through the predictions that can 
be made from it. 

Any useful diagnosis contains within itself an element of progno- 
sis—of forecasting the future. A diagnosis of reading deficiency as 
centering in inability to analyze unfamiliar words carries an implica- 
tion that if word-study skills are taught the deficiency. will be over- 
come. A diagnosis of a case of classroom unruliness as due to a 
school curriculum being too academic for the pupil's abilities implies 
that if the pupil is shifted to a more practically oriented curriculum 
the disciplinary problems will be alleviated. 

Arriving at a diagnosis is, then, a process of studying an individ- 
ual thoroughly so as to identify the factors significant in producing 
his symptoms and the interrelationships of these factors. This un- 
derstanding involves concepts of causation and implies predictions 
as to the effect of remedial measures. 


ROLE OF MEASUREMENT IN DIAGNOSIS 


"Careful study of the facts about something" implies accurate 
description of the something and the essential circumstances sur- 
rounding it. The refinement and improvement of this description is 
where measurement enters in. Tests and measurements are useful 
in diagnosis in proportion as they permit a more accurate statement 


of the facts. There are a number of points that it will pay us to 
consider. 


MEASURES TO BE USED 


The very fact that the clinician is dealing with the distinctive 
problems of a specific person implies that it is necessary for him to 
select the tests to be used in terms of the nature of the specific prob- 
lem. The decision rests upon a judgment of which facts are likely to 
be significant in the light of what is already known about the case. 
Of course, it is never possible to be sure which facts will be important 
for a specific case. That is why clinical case studies tend to be com- 
plete and wide-ranging. The good M.D. does not limit himself to 
the most likely causes of a symptom but makes tests to check upon 
other possibilities. Similarly, in the study of a child who is educa- 


INDEPENDENT VERSUS CORRELATED TESTS 533 


tionally retarded or emotionally disturbed, prejudgment as to the 
one or two most important factors is inappropriate. The clinician 
needs to gather information of many sorts and to assess various pos- 
sibilities. 

However, accumulated experience has indicated the special value 
of certain items of information in specific types of problems. Testing 
for the specific case is oriented in terms of this general background 
of experience. Thus, an individual intelligence test such as the Binet 
or Wechsler has been found to be important for understanding a wide 
variety of problems, both educational and personal. A clinician 
would want the information provided by such a measure almost 
routinely. Many clinical psychologists feel that a projective test 
such as the Rorschach has similar generality of value in cases of per- 
sonal adjustment. 

Other tests will be indicated by the specific nature of the problem. 
Measures of academic achievement will be appropriate when the 
problem is school-related. Diagnostic tests of specific educational 
skills will be called for if the problem relates to deficiency in an 
academic area. Non-language ability measures will enter in if a 
language handicap is suspected. Tests designed to appraise mental 
deterioration will be indicated for cases in which certain types of 
mental disorders are suspected. Other types of problems call for 
other types of information as providing facts likely to be significant. 

The same criteria of validity and reliability that constitute general 
bases for the evaluation of measurement procedures apply in the 
choices by the clinician. Many instruments that look attractive on 
the surface in terms of the information he would like to have will be 
rejected because they cannot be shown to provide that information 
with sufficient validity or precision. Beyond this, accumulated ex- 
perience as to the facts likely to be important in problems of a given 


type will guide the choice of tests to be used. 


INDEPENDENT VERSUS CORRELATED TESTS 

veral tests to be used in studying an individ- 
hat each one gives us new and inde- 
pendent information about him. We do not expect to invest time in 
giving the client three or four different intelligence tests because we 
realize that each one will give us much the same sort of evidence 
about him as the others. The second and third and fourth intelli- 
gence test would presumably increase the reliability of our appraisal, 
but they would do little to increase its scope. Ordinarily, we need 
many types of information in order to build an adequate understand- 


Usually, in selecting se 
ual, we choose the tests so t 


534 MEASUREMENT IN DIAGNOSIS 


ing of a case. If we are to get that information economically see 
efficiently, each additional test must give us substantial amounts ol 
new information. It must duplicate as little as possible the tests 
we have already given. À 

The need for independent tests is most apparent in the pasate 
of guidance and personnel selection considered in the preceding tw 9 
chapters. For most effective guidance, we need information about 
the wide range of different abilities and interests that have signifi- 
cance for educational and vocational success. In personnel selec- 
tion, the validity of a team of tests depends not only on the validity 
of each but on the independent information cach additional test adds 
to the pool. As a general working principle, one objective in pick- 
ing tests for a guidance or selection battery should be to keep cor- 
relations with other tests in the battery as low as possible. 

In the diagnostic use of tests, independence of the separate meas- 
ures is also often sought. In our testing we are trying to get informa- 
tion about many different factors that may have entered into the 
current problem. We are trying to get a full and comprehensive 
description of the individual. This calls for measures of separate 
and independent functions. There are 
tings in which this desire for independ 
These call for special consideration. 

Suppose that we have given a 10-year-old boy a reading compre- 
hension test and that he reads at the 7-year-old level. We wish. to 
decide whether this boy should receive special remedial instruction 
in reading. What additional information should we seek? 

In this case, the important 


‚ however, certain specific xr 
ent information does not hold. 


additional information that we require 
is not how the boy will perform on measures ordinarily unrelated to 
reading ability. We need to get a score on a test that measures 
general intellectual ability or that measures language comprehension 
without using the mechanics of reading. These are tests normally 
showing a high correlation with reading comprehension. They 89 
called for at this time because, in order to diagnose the case specifi- 
cally as one of reading disability, we must show that comprehension 
through reading falls substantially below comprehension that does 
not depend on reading. 

Here, it is the existence of a large 


Я „mally 
discrepancy where normally 
little or no discrepancy 


is found that permits us to make a useful 
diagnosis. If we know only that a child does not read as well as other 
children of his age, we have no basis for deciding whether we are 
faced with a child of limited ability or a child with adequate ability 
who has for some special reason failed to learn to read. We must 


ACCURACY OF DIFFERENTIAL DIAGNOSIS 535 


make a differential diagnosis. We must reach a sound decision be- 
tween these two alternatives. А diagnosis of reading deficiency de- 
pends upon the existence of a marked discrepancy between abilities 
that normally go together. If we wish to judge whether it is tlie 
mechanics of reading that interfere with the individual's compre- 
hension, we need a comparison measure differing from the reading 
test only in that it does not depend upon the mechanics of reading 
In this setting, a test of aural comprehension would seem to gieovida 


the most suitable comparison. 

There are other situations in which the use of tests that normally 
show a high correlation is indicated, and in which we seek to get 
tests that are more closely correlated rather than more independent. 
The appraisal of mental deterioration is such a situation. Attempts 
; ther a psychotic individual has declined from a pre- 
viously higher level of mental ability are based upon the use of re- 
lated types of mental tests. An attempt is made to get one test 
very resistant to the inroads of mental upset, such as a vocabulary 


test, and another one very susceptible to loss. The more the two 
nsitive the difference between them is to 


to appraise whe 


tests correlate, the more se 


the effects of deterioration. 

Whenever a diagnosis is based on the presence of a difference 
where no difference normally occurs, we will find it necessary to work 
with pairs of correlated tests. Then we will seek to maximize the 
correlation rather than to minimize it. This brings us face to face 
with the problem of the reliability of difference scores. As we saw 
їп Chapter 6, the reliability of the difference between two corre- 
lated measures is lower than that of the separate measures and may 


be quite low if the intercorrelations are high. 


ACCURACY OF DIFFERENTIAL DIAGNOSIS 

shall need two related concepts. 
stribution of differences and the standard 
Just as the variability or 
by the standard deviation, 
rences between two scores 
distribution of 


At this point, we These are the 
standard deviation of a di 
nt of the difference. 


error of measureme 
ssed 


spread of a set of scores can be expre: 


so the variability or spread of a set of diffe 
standard deviation of the 


can be expressed by the 
differences. The formula for this is 


$192 = Vs? + 5: 


ation of the difference. 
tions of the two scores. 


— 27125152 (1) 


Where s,» is the standard devi 
standard devia 


sı and s» are the 
two scores. 


7i» is the correlation of the 


536 MEASUREMENT IN DIAGNOSIS 


When the two standard deviations are equal this becomes 


F 


$120 = 51У 2 = 2715 (2) 
The standard error of measurement of the difference is given by 
the formula 


La... 1. 
Sens (12) WS Meum p PS Mosel (3) 
(This formula assumes that errors of measurement for the two tests 
are uncorrelated.) 

If the standard deviations for the two score distributions are equal, 
this can be written 


S Meas. (0-2) = 81e — fij — fae (4) 


A difference is rather complicated to evaluate unless the two tests 
are expressed in terms of comparable norms; i.e., unless they are 
expressed in such a form that a normal group has the same mean and 
standard deviation for each test. Then the average expected dif- 
ference is zero. The unusualness of a difference may be judged by 
reference to formula 2. The likelihood that the difference has arisen 
due to chance errors of measurement can be judged by formula 4. 

Suppose that we have a reading test and an intelligence test, cach 
having the same mean and a standard deviation of 15. Suppose 
further, that the reliabilities are respectively 0.85 and 0.90, and the 
intercorrelation is 0.60. Then we have, by formulas 2 and 4, 


512 = 15V/2 — 1.20 = 134 


Умен. a2) = 15У2 — 0.85 — 0.90 = 7.5 
Now consider pupil W, who gets a reading test score 10 points 
below his intelligence test score. How shall this difference be inter- 
preted? 

_Dividing the 10-point difference by the standard deviation of the 
difference (13.4), we get 0.75 and, referring to tables of the normal 
curve, we find that a reading score this far below the intelligence test 
score can be expected in about 23 per cent of the cases in a group. 
Dividing the difference by its standard error of measurement (7.5) 
we get 1.33. Reference to the tables of the normal curve indicates 
that errors of measurement could be expected to yield a reading score 
this far below the intelligence test score in about 9 per cent of cases. 

These two percentages, 23 and 9, define, first, the unusualness and, 
second, the dependability of the difference. The two values, singly 


ACCURACY OF DIFFERENTIAL DIAGNOSIS 537 


and in combination, provide the basis for evaluating and interpret- 
ing the difference. i 

Note first that the unusualness of the difference depends upon the 
correlation between the two tests. With a correlation of 0.60, if we 
carried out a school survey a reading deficit as large as 10 points 
could be expected in about 23 cases out of 100. If the correlation 
were 0.00, this difference could be expected in as many as 32 cases 
in 100, whereas if the correlation were 0.85 the difference could be 
expected no more than 12 times in 100. Note that in this last case 
9 of the 12 would have to be considered to have resulted from errors 
surement. This is an illustration of the reduced reliability of 
difference scores when dealing with substantially correlated tests (see 
Chapter 7, p. 180). In this case large differences are infrequent, so 
nt that practically all that do occur may be attributed to 


of mei 


infreque 
chance. 
With the correlation of 0.60 that was initially assumed for our 
example, 23 cases out of 100 will show a difference as large as the one 
that was obtained, and 9 of these must be thought of as due to errors 
Thus, if we had carried out a testing survey ina 


of measurement. 
school system, in roughly 40 per cent of those cases where we found 


more points we should think of the differ- 


reading deficits of 10 or 
nt and expect them to 


ences as stemming from errors of measureme 
disappear if we gave à new reading test and intelligence test. Or 
a different way, in such a survey for any given child 
in 5 that an observed difference 
gree of real difference between 


expressing it in 
there would be only about 3 chances 
of 10 points corresponded to some de: 
reading level and intelligence level. 
The figures of the previous paragra 
is for an individual b 
and have, on the ba 


ph bring home the tentative- 
ased on a survey of a group. 
is of the 


ness of any diagno: 
When we have carried out a survey 
r of atypical cases, we must expect a sub- 


cases to evaporate upon retesting. We 
binations of errors of measure- 


survey, identified a numbe 
stantial proportion of those 
have capitalized upon peculiar 
ment, We have picked out the cases 
minus errors on the other. Cases so 
closer scrutiny. Incidentally, some of these 
‘Gf we apply any type of treatment and then retest 


com 
with plus errors on one test and 
identified will not hold up under 
cases will provide miracu- 


lous "cures' 
them. 
When we are dealing sing 
grounds, the logic of the situation is quit 
to deal with the same pair 
ppose now that we a 


with a single case, selected on independent 
е different. Suppose that 
е a reves of tests, having the same 
e But su re testing a single child, 


538 MEASUREMENT IN DIAGNOSIS 


sent to us because his teacher felt that he was having special diffi- 
culties with reading. Again, we find his reading score 10 points below 
his intelligence score. How are we to evaluate this difference? 

Here the only relevant consideration is the standard error of meas- 
urement of this difference (7.5 points), and the significant evidence 
is that a difference of this size could be expected to arise by error of 
measurement alone 9 times in 100. That is, under these circum- 
stances we should conclude that there are 10 chances in 11 that this 
child has some degree of genuine deficit in reading performance rela- 
tive to his intellectual level, . 

We can differentiate the two situations we have described in this 
way: 


1. In the first case, we were trying to discover hypotheses about 
the individual. There we capitalized upon all the differences pro- 
duced by chance, and in a large proportion of the cases the differ- 


ences that appeared would have to be attributed to chance. 


2. In the second case, we were trying to fest an hypothesis already 
suggested from some other source, 
ity of the specific difference 


In this instance, the probabil- 
being due to chance was much less. 

This distinction between discovering and testing hypotheses is à 
very important one so far as the assurance with which we can in- 
terpret test score differences is concerned, Our interpretation must 
always be tentative, a matter of probability, However, the prob- 
abilities of the difference being genuine are much higher when we are 
in the situation of testing a specific hypothesis. Furthermore, in 
this situation our test results can usually be considered an unbiased 
estimate of the true scores. There is no systematic tendency for ob- 
served differences to shrink, as there is when we use a survey to pick 
out extreme or atypical cases, 


THE BEST ESTIMATE OF TRUE SCORE 


So far we have been emphasizing the fallibility of scores for an 
individual and the possibility that the differences we obtain may 
have arisen due to errors of measurement, However, as we have 
indicated in the previous section, when We give a set of tests to one 
specific individual and get a set of scores, the values that we ob- 
tain can usually be considered unbiased estimates of the true scores: 
They represent the best guesses we can make as to what the true 
scores for that particular individual are. So, in the illustration we 
have been using, if the child's reading score is 10 points below his 
intelligence score, this difference of 10 points represents our best esti- 


SUMMARY STATEMENT 539 


mate of the true state of affairs. The best that we can do is to assume 
it is true difference and plan our actions accordingly. Our practical 
actions should ordinarily be based on the assumption that the ob- 
tained difference is the correct difference, though any judgment we 
make should be tentative and any action considered subject to re- 
view in the light of new evidence. 


IDENTIFYING THE ESSENTIAL FACTORS 


Suppose, now, that we have completed our testing, interviewing, 
and other procedures for gathering evidence to provide as complete 
and accurate a description of the individual as it seems desirable or 
practical to get. How are we to decide what are the “essential 
features, faults, etc." ? Here techniques of measurement have noth- 
ing to offer us. We must depend on the accumulated background 
of experimental research and clinical experience in the field. This 
research and experience will have indicated certain factors as re- 
curring sources of the type of problem or difficulty we find in our 


present case. We will need to consider these as possible causal 


factors. 

The description of the present case will usually indicate a number 
of factors that may be the essential features of the problem. The 
child with reading deficiency had a poor teacher in the second grade, 
has defective skills of word study and analysis, has a younger sister 
who is favored by the parents, and is rejected by other pupils in the 
class. In the light of accumulated research and experience, we hy- 
pothesize that some one or more of these is the essential (or the 
remediable) feature of the case and take steps to remedy that condi- 
tion or situation. The hypothesis is a tentative one. If our efforts 
are successful, we consider that the hypothesis has 

If our efforts are unproductive, we must usually 
sis and try other procedures to produce improve- 
of our efforts that we get cues to the ap- 
| suggestions as to further evi- 


at remediation 
been supported. 
modify our hypothe: 
ment. It is in the outcome 
propriateness of our hypothesis and 
dence that may be needed. 


SUMMARY STATEMENT 


Measurement contributes to clinical diagnosis by providing шыр 
accurate descriptions of factors or conditions that may be impor tant 
in producing the present unsatisfactory state of affairs. е: ш 
cedures are chosen to cover factors that accumulated research an 


540 MEASUREMENT IN DIAGNOSIS 


clinical experience have shown may be significant in cases of this 
sort. The usual criteria of validity and reliability apply. 

Ordinarily, different tests are chosen because they are relatively 
independent and will give us different tvpes of information about 
the individual. However, when we wish to make a differential di- 
agnosis, we need to use measures that are normally closely corre- 
lated. The differential diagnosis depends upon the presence of dif- 
ference where normally little difference is found. 

The use of highly correlated tests raises doubt as to the dependa- 
bility of the obtained score difference. When the tests are used to 
locate cases for special treatment, identification becomes quite un- 
reliable and judgments must be extremely tentative. When the tests 
are used to verify hypothesized differences, much more confidence 
can be placed in the results. 

In the study of a single isolated case, obtained test scores usually 
represent our best estimate of true scores. 
should always be tentative, due to the range 
we will be most correct in the long run if we 
tion that the obtained scores are correct. 

Determination of which of a set of facts 
the essential facts cannot be made by 
hypothesis must be adopted on the 
of research and clinical work with ca 
this hypothesis must be tested by 


Though interpretations 
of error that is possible, 
operate on the assump- 


about an individual are 
measurement techniques. An 
basis of the general background 
ses of the sort being studied, and 
the outcomes of treating the case. 


QUESTIONS FOR DISCUSSION 


1. When is it desirable that the tests used for diagnosis be as independent 
as possible? Why? When is it desirable that they be highly correlated? 
Why? ` Mu 

2. Why is high reliability especially import 

3. A college student is failing se 
you want in order to mak 

4. A sixth-grade pupil 


ant in tests used by the clinician: 
veral courses. What information would 
a diagnosis of his problem? 


. У e can read only at the second-grade level. What 
items of information would b. 


15 e important. if you were to understand this 
pupil? 

5. A junior high-school girl is unsociable, moody, and has no friends. What 
place would tests have in understand 


\ ing this girl? What other sorts of in- 
formation would you want? 

6. What Important difference is there between using tests to locale cases 
of reading disability and using tests to verify suspected reading disability? 


! 
| 


Computation of Square Root 


If facilities are available, it will be easiest to get the value of the square 
root by using a slide rule or a set of tables of squares and square roots. (Such 
tables will be found in engineering handbooks and books of tables for mathe- 
Maticians and statisticians.) When no such aids are available, the following 
Computing routine may be used. 


Instructions Example 
1. Starting at the decimal point, break the num- 10.27 03 
ber up into blocks of two digits cach. 
2. Look at the first block on the left. Pick the 10. 
largest number whose square is equal to or less 
than this number. 3 = 9 
3. Place the number from 2 to the right of the 10.27 03 3 
problem, and subtract the square from the 9 
block on the left. 1 
J. Bring down the next block of two digits. 10.27 03 3 
9 
1.27 
5. Double the number on the extreme right, and 10.27 03/3 
place it to the left of the number in 4 above. 9 
(Leave a space on the right of the doubled 6 1.27 
value.) 
6. Pick the largest digit that can be put after the 10.27 03|3.2 
entry in 5 above that can be used as a multi- 9 
plier of the 2-digit number and still have it less 62 1.27 


than the number resulting from 4 Put this 
number after the entry in 5 and also after the 
figure at the extreme right. 

541 


542 APPENDIX 1 


Instructions 


. Multiply the entry on the extreme left (62) by 


the digit on the extreme right (2), and put the 
product below the entry from 4. Subtract. 


Bring down the nest block of two digits, and 
repeat steps 5, 6, and 7. (Since 640 is greater 
than 303, it goes in zero times.) 


Bring down the next block of two digits (zeros) 
and repeat steps 5, 6, and 7. Repeat until the 
desired number of places is obtained. 


Example 
10.27 0313.2 


62. |. 
1 


62 1.27 


204 Ans. 


0 00 


6404 |3 03 00 
25616 


46 84 


Appendix Il 


Caleulating the Correlation 
Coefficient 


ficient is an index that expresses the extent to which 
It indicates the extent to which high 
But “high” and "low" must 


The correlation coe 
two variables (X and V) go together. 
X scores go with high F scores, and vice versa. 
be expressed in some uniform terms from one set of data to another if the 
index is to have the same meaning for different sets of data. The standard 
framework for expressing “high” and „ow“ is the mean and standard devia- 
tion of the group. If each X or Y score is expressed as being so many stand- 
ard deviations above or below the group mean, the product of these X and 
Y standard scores is calculated, and the average of these products is obtained, 
the result is the Pearson product-moment correlation coefficient. 
ed by the following formula: 


This can be express 


r= = 


where r is the correlation coeficient. 
z, and sy are standard scores in X and Y. 


N is the number of cases. 


the correlation coefficient. Now we must consider 
Below are outlined the procedures for computing 
aw test scores. The procedure is illus- 
iding and arithmetic tests shown in 


This is a definition of 
the steps in computing it. 
the correlation coefficient from sets of r 
trated with numerical data from the rez 
Table 5.1. 


In our illustration, both arithmetic 
and reading scores are grouped by 
38. 


Step 1. Select class intervals for both 
of the variables. 


The tabulation sheet is shown in 
Fig. 1. The X variable is the arith- 
ore and the Y variable the 


Step 2. Prepare a two-dimensional 

tabulation sheet, indicating class in- 
the X variable on the metic 
reading score. 


SC 


tervals for 
top and for the Y variable on the 
left of the chart. Cross-section pa- 
per or special tabulating sheets can 


be used with advantage. 
543 


544 


APPENDIX 11 


Variable X (arithmetic score) 


— 
AN ио 0 ~ 
. 
о m v o 


15-17 


18-20 
21-23 


24-26 
27-29 
30-32 
33-35 
36-38 
39-41 
42-44 
45-47 
48-50 


58-60 


| | 312-14 


55-57 


52-54 


49-51 


46-48 |] 


7 
/ 
/ ri 
2255 


43-45 


40-42 


1 


ES 
ES 


ГА 


37-39 


pu 
pu 
X 


34-36| 


31-33| 


Variable Y (reading score) 


28-30 


25-27 


22-24 


Fig. 1. 


Step 3. Tally the data, entering each 
score as a tally mark in the cell 
corresponding to the X and F score 
for that case. Count the number 
of tallies in each cell, and write in 
the frequencies in the upper part of 
the cell. 


Step 4. Sum down each column and 
enter the totals on the bottom edge 
of the tabulation sheet, Sum across 
each row and enter on the right. 
These totals entered in the margin 
give the simple frequency distribu- 
tion for Y and Y, respectively. 


Steb 5. Consider the values entered 
at the right of the table in step 4. 
They make up a simple frequency 
distribution of Y scores. Following 
Chapter 5 (pp. 95-97), carry out 
the steps for calculating the stand- 
ard deviation. Determine N, Ify 
and Xf(y/)?. 


Tally marks have been entered in the 
tabulation sheet in Fig. 1. The fre- 
quencies are indicated in Fig. 2. 


Sums are shown in Fig. 2. The entries 
across the bottom are for the X 
(arithmetic) variable and. those at 
the right for the V (reading) varia- 
ble. 


The values for у", fv’, and f( y)? are 
shown in the three columns just to 
the right of the column of frequen- 
cies in Fig. 2. For this sample. 
Ју = -7 and ХДУ) = 999: 
(It may be noted that the Y varia- 
ble is the reading test, and that 
these values are identical with the 
ones calculated that test in 
Chapter 5.) 


for 


CALCULATING THE CORRELATION COEFFICIENT 545 


Variable X (arithmetic score) 


- һо о фо 
Eb dE E S8339798,, fy. fo T 
П U 1 1 1 a ts 
2222288888888 888 unn 
— vel 
[| 8 64 56 
T 7 49 14 
6 36 42 
10 EU 70 
— 8 32 28 
9 27 12 
6 12 6 
1 7 -3 
(+61) 
- 7 5 
-10 20 4 
-12 36 12 
-12 48 -4 
-15 75 35 
-12 72 0 
(789) 
-12Xfy 
gi 
sondage P r3 567 8 911 535= zf'y 
Ре = xſry 
-12 -10 -8-12 -10 -4 lew 452 ½ % exe 
(739) (+92) 
"vi 
f) 77 80 32 з a 1 2 36 16 5 ma 6 0 B= х? 


Fig. 2. 


Ste " А 

сл 6. Repeat step 5 for the fre- 
чиепсїев of the X variable entered 
at the bottom of the table. 


E кы the frequency. їп 

by the P the two-way tabulation 

cell p and the y’ values for that 

in | nter this product, ade 
the lower corner of the cell. 


This 
mei will be easier if the col- 

and row chosen for the arbi- 
trary origin are enclosed in heavy 
tules, to show the zero point for 
cach scale. The frequency in a cell 
Must be multiplied by both the a” 
value for that cell and the y” value. 


These € 


The value of Sfx’ is +36: fa) 


equals 748. 


ntries have been cireled in 
Fig. 2. Consider the row just above 
the heavy horizontal rules. Going 
right from the heavy vertical rules, 
we come to a frequency of 1 in the 
second cell. For this cell f=. 
х= l. andy = 2, so the product 
isix1X2- For the next cell 
in the row f = 1. % = Land Y = 3, 
so the product is1x1X B ew. 
Notice that in the upper left and 
lower right quarters of the table, 
the products are negative, because 
either v’ or Y is negative. Also no- 
tice that all products for cells be- 
e heavy lines are zero- 


tween th 


546 

Step F. Sum the Fu values for all 
the cells. This gives Xfa^v'. the 
sum of all the products of x' and у” 
values. 


Step 0. The formula for computing 
the correlation coefficient is 


е, 
x (3 


N 


Substitute the proper values in the 
formula and solve. (It should be 
noted that the two terms in the de- 
nominator are merely the formulas 
for the standard deviation of X and 
Y, respectively. 


APPENDIX 11 


In the example, the values have first 
been summed across each row, and 
these sums entered in the column 
at the far right. This column has 
then been summed to give fx” 


y 
= 277. 


For our example, the solution be- 
comes: 


а сар 
х “14.38 — (0.692)?! 


5.42 


V10.28 4713.91 


[] 
сә 
> 
е 


Appendix IIl 
Section А 


GENERAL INTELLIGENCE TESTS 


ARMY GENERAL CLASSIFICATION TEST: 1ST CIVILIAN EDITION (AGCT) 
Science Research Associates Testing time: 40-50 min. 
Range: Grades 9 to 16 and adults 

The reliability is reported as 0.92, based on World War II inductees. The 
only validity data reported are the relationships of ЛССТ scores to occupa- 
tional classification. To the inexperienced user these data may be very mis- 
leading. Data given on construction and standardization of the test are also 
misleading, since this form was administered to less than one million men 
and was replaced before December, 1941. This form of the AGCT is not the 
AGCT that was administered throughout World War II. It has nothing 
special to recommend it to the civilian user in preference to other intelligence 
tests designed for the same level. 


AMERICAN. COUNCIL ON EDUCATION PSYCHOLOGICAL EXAMINATION FOR 


COLLEGE FRESHMEN (ACE) 
Educational Testing Service 
Range: College freshmen 


Testing time: 40-65 min. 


test is published each year. Although reliability data 
users of test can be assured of high relia- 
bility on the basis of previous forms. Little or no validity data are given in 
the manual, but many studies have shown that the total score and L score 
have predictive value for certain types of achievement in college. The test 
gives three scores: (1) L (language). (2) Q (quantitative), and (3) total. Total 
The L and Q scores have some diagnostic value. 
alue in predicting academic success. Norms are 
у of different types of colleges. One of the best 


А new form of the 
аге not available for the new form. 


Score is the most reliable. 

Fhe Q score is of limited v 
extensive and cover а variet 
tests for use with college freshmen. 


AMERICAN COUNCIL ON EDUCATIONAL PSYCHOLOGICAL EXAMINATION 


FOR HIGH SCHOOL STUDENTS 
Educational Testing Service 
Range: Grades 9 to 12 


Testing time: 35-65 min. 


Reliability is satisfactory. Material 


ed each year. 
ise with college freshmen. Norms are 


Е А new test form is issu 
is similar to that found in the form fort 
547 


548 APPENDIX Ill 


probably not representative of the national high-school population. The test 
will probably be most useful in counseling students who intend to continue 
their education beyong high school. 


CALIFORNIA SHORT FORM TEST OF MENTAL MATURITY 


California Test Bureau Testing time: 40-60 min. 

Range: Pre-Primary (kindergarten to 1); Primary (grades 1 to 3); Elementary 
(grades 4 to 8); Intermediate (grades 7 to 10, adults); Advanced (grades 9 
to 16, adults) 


The reported reliability coefficients for all levels are spuriously high because 
they are based on two or more grades combined. No validity data are re- 
ported in the manual. The test gives three scores: (1) verbal, (2) non-verbal, 
and (3) total LQ. About one-half of the test involves no reading even at the 
higher levels. The elaborate diagnostic profile provided for each test should 
probably be ignored since each subtest has only a few items and is extremely 
unreliable. The test is adequate as an over-all measure of intelligence. 


CALIFORNIA TEST OF MENTAL MATURITY 
California Test Bureau Testing time: 90-110 min. 
Range: Same as for California Short Form Test of Mental Maturity 


Reliability coefficients reported in the manual are probably spuriously high 
because they are based on two or more grades combined. The only evidence 
reported on validity is a correlation of 0.88 with Stanford-Binet, but no infor- 
mation is given about the characteristics of population on which this corre- 
lation is based. The test gives eight scores: (1) memory, (2) spatial relation- 
ships, (3) logical reasoning, (4) numerical reasoning, (5) vocabulary, (6) total 
language factors, (7) total non-language factors, and (8) total mental factors. 
Only the last three scores are probably reliable enough to be used. The diag- 
nostic procedures suggested in the manual should be used with extreme cau- 
tion. The test is adequate as an over-all measure, and the separate language 
and non-language scores have some diagnostic value, 


CHICAGO NON-VERBAL EXAMINATION 


Psychological Corporation Testing time: 40 min. 
Range: Age 7 to adults 


The test is designed to measure non-verbal aspects of intelligence. Tt is 
composed of ten subtests which may be administered either with oral direc- 
tions or pantomime directions, Reliability coefficients range from 0.80 to 
0.93 obtained by both split-half method and retest on groups with ranges of 
2 and 3 years in chronological age and 2 to 6 grades in school placement. The 
test will probably be useful for children. with hearing difficulties or reading 
disabilities. It may also be useful with both children and adults who have 
limited use of English. Some of the pictorial material is very poorly drawn 
and reproduced. ` 


LUI 


GENERAL INTELLIGENCE TESTS 549 


HENMON-NELSON TEST OF MENTAL ABILITY 


Houghton Miftlin Company Testing time: 30-35 min. 
Range: Grades 3 to 8; grades 7 to 12: grades 12 to 16 


Reliability appears to be satisfactory. The test gives a single total score, 
no part scores. Administration is simple. Tests can be scored quickly as 
they have self-marking answer sheets. The manual is entirely inadequate in 
all respects. The test measures verbal ability only. The general format is 
poor. Each level of the test tries to cover too great a grade range. The ele- 
mentary test works best in grades 4 to 7. 


KUHLMANN-ANDERSON INTELLIGENCE TESTS—SIXTH EDITION 
Personnel Press, Inc. Testing time: 30-45 min. 
Range: Grades, Kindergarten, 1, 2, 3, 4. S. 6, 7 to 8, and 9 to 12 


А separate booklet is prepared for each grade up to grade 7; the upper two 
levels cover a range of grades. Each level is identified by a letter on the test 
booklet, not by grade level, so that a lower level can be used in a class without 
the students knowing it is a lower level. Reliability is satisfactory. coefficients 
ranging from 0.88 to 0.95 computed for a single grade. Validity data as given 
The 1.0. obtained is based on the median of the 
mental ages on the separate subtests. The test yields a single over-all I.Q. 
The test is highly verbal, although there is little actual reading. It has many 
Separately timed subtests. which makes administration somewhat difficult. 
The subtests are not to be used for diagnosis. This is one of the best all- 
around group intelligence tests to get an over-all mental age. 


in the manual are adequate. 


LORGE-THORNDIKE INTELLIGENCE TESTS 
Testing time: 30-45 min. 


Houghton Mifflin Company 
9, 9 to 13, and adults 


Range: Kindergarten to 1, 2 to 3, 4 to 6. 7 to 


The test series is available for all grade levels. For grades 4 and above, 
Separate verbal and non-verbal tests are available. The test items were se- 
lected on basis of extensive item analysis, and the norms were based on a 
Systematically stratified sample. Special care was taken to provide clear 
typography and easy-to-read format. Reliabilities are satisfactory, but there 
15 as yet little evidence on validity. A workmanlike new test, published in 


1954, 


T, FORM 21 


OHIO STATE UNIVERSITY PSYCHOLOGICAL TES 
Testing time: No time limit 


Ohio College Association 
Range: Grades 9 to 16, adults 
primarily to predict success in college (corre- 


lations of about 0.60 with scholastic performance). Reliability is high. It 
emphasizes verbal ability. Norms are based on Ohio high-school students 
and freshmen in Ohio colleges. A good scholastic aptitude test at the college 


level, 


This is a power test designed 


550 APPENDIX III 


OTIS QUICK-SCORING MENTAL ABILITY TESTS 


World Book Company Testing time: 20- 35 min. 
Range: Alpha (grades 1.5 to 3): Beta (grades 4 to 9): Gamma (grades 9 to 16) 


These have been among the most widely used tests in the public schools 
Reliability is satisfactory. Evidence is presented that the scores have value 
in predicting school achievement. The tests are extremely easy to administer 
and score. They give a single over-all 1.0. Primarily, they test verbal 
ability. The Alpha test requires no reading, but the other two levels do re- 
quire reading. 


PINTNER GENERAL ABILITY TESTS, NON-LANGUAGE SERIES 


World Book Company’ Testing time: 50 60 min. 
Range: Intermediate (grades 4 to 9) 


This test requires no reading or use of language. Reliability is satisfactory. 
The test supplements the Pintuer verbal series in the intermediate range. 
Directions are quite elaborate. The test must be hand-scored. It is useful 


for children with hearing or reading difficulties and is probably most useful 
in grades 6 and 7. 


PINTNER GENERAL ABILITY TESTS: VERBAL SERIES 
World Book Company 
Range: Primary (grades kindergarten to 2): 
Intermediate (grades 4.5 to 9.5); 


Testing time: 45-55 min. 
Elementary (grades 2.5 to 4.5): 


Advanced (grades 9 and above) 


Reliability is satisfactory. As evidence on validity, the manual reports 
the relationship between test scores and standardized achievement test scores, 
school grades, and other tests. The manual is very good. Five methods of 
interpreting scores are given. Each subtest is separately timed. The adminis- 
tration of the tests is somewhat more demanding than for other group tests. 


TERMAN-McNEMAR TEST OF MENTAL ABILITY 


World Book Company Testing time: 40-45 min. 
Range: Grades 7 to 12 

Reliability is satisfactory. The only validity data reported is the correla- 
tion between this revision and the earlier Terman Group Intelligence Test. 
The entire test is made up of verbal material involving reading, and it yields 
a single LQ. A well-standardized and well-constructed measure of verbal 
ability. 


APTITUDE TEST BATTERIES 551 


Section B 


APTITUDE TEST BATTERIES 


CHICAGO TESTS OF PRIMARY MENTAL ABILITIES 


Science Research Associates Testing time: 240 min. 


Range: Nges 11 to 17 


The battery gives six scores: (1) number, (2) verbal meaning, (3) space. 
(4) word fluency, (5) reasoning, and (6) memory. The battery is based on 
factor analysis (Thurstone’s multiple factor theory). However, it is not clear 
that the tests are more independent than those in other aptitude batteries. 
Reliability data reported in the manual are actually not based on the present 
test but on the longer 1941 editions: therefore, the data must be discounted. 
Validity data are reported only for the 1941 edition. The test is a speed test 
Norms are based only on Chicago children. Norms 
although there are marked sex differ- 


rather than a power test. 
are not differentiated according to sex 
ences on some tests. 


DIFFERENTIAL APTITUDE TEST BATTERY 


Psychological Corporation Testing time: 300-330 min. 


Range: Grades 8 to 12 


, See text, pp. 251 255 for discussion. This is a practical guidance battery 
for high-school use. It has an extremely full and well-organized manual. 
Extensive validation data are presented against educational criteria, but little 
against vocational criteria. Claims for validity are modest and realistic. 


FACTORED APTITUDE SERIES 


Industrial Psychology, Inc. Tuscon, 
Range: Adults 


Ariz. Testing time: 115 min. 


This is a battery designed primarily for industrial use. 
ms to be assumed rather than 


See text, p. 255. 
The validity of the tests for specific pur 


demonstrated. 


poses sec 


N TESTS (FACT) 


FLANAGAN APTITUDE CLASSIFICATIO 
Testing time: 300-330 min. 


Science Research Associates 


Range: Grades 12 and above, and adults 


and ambitious battery for school and in- 


This is a new s 
est results run far beyond the 


See text, p. 255. ) 
erpretations suggested for t 


dustrial use. The int 
evidence. 


552 APPENDIX III 


GENERAL APTITUDE TEST BATTERY 


U. 5. Employment Service Testing time: 120-150 min. for 
Range: Grades 12 and above and adults group tests 


See text, p. 255 for discussion of this battery. [t is available only for use 
by State Employment Offices. 


GUILFORD-ZIMMERMAN APTITUDE SURVEY 


Sheridan Supply Company Testing time: 140-190 min. 
Range: Grades 9 to 16, adults 


See text, p. 256. This set of tests is based on factorial 
Evidence of the validity of the tests for specific 
provided. 


analysis of abilities. 
occupations remains to be 


Section C 


READING TESTS 
CALIFORNIA READING TEST 


California Test Bureau Testing time: 35-55 min. 
Range: Primary (grades 1 to 4,5); Elementary (grades 4 to 6); Intermediate 
(grades 7 to 9); Advanced (grades 9 to 14) 


| "This test, which appears also as a subtest of the California Achievement 
Tests, gives three scores: (1) reading vocabulary, 
and (3) total. The primary battery may be somewhat difficult for average 
first grades and somewhat too easy to measure adequately the reading achieve- 
ment of above-average fourth graders. The diagnostic claims for the test in 
the manual and the diagnostic profiles should be used with егете caution, 
since the subscores are based on a small number of items. The intermediate 
and advanced batteries, especially, should be adequate for class survey pur- 
pos 5. 


(2) reading comprehension. 


DURRELL-SULLIVAN READING CAPACITY AND A 


World Book Company 
Range: Primary (grades 2.5 to 4.5) 


CHIEVEMENT TESTS 
Testing time: 45 min. 
: Intermediate (grades 3 to 6) 

Gives five scores: (1) word meaning, 
(optional), (4) written recall. (optional), 
test (word meaning and paragraph те 
is given orally so that the e 


(2) paragraph meaning, (3) spelling 
and (3) total. The comprehension 
1 aning) of the reading capacity section 
aminer can obtain a measure of the pupil's capacity 


READING TESTS 553 


to understand written language. The reading achievement section is read 
by the pupil without help from the examiner. The fundamental assumption 
of these tests is that serious reading disabilities can be discovered by revealing 
discrepancies between. the pupil's understanding of spoken language and his 
understanding of the printed word. Split-half reliabilities for part scores on 
reading capacity test for single grade groups range from 0.78 to 0.91, being 
somewhat higher for older children. Reliability for the reading achievement 
section ranges from 0.83 to 0.95. Although these reliabilities are somewhat 
low for individual diagnosis, the test would probably be useful to classroom 
teachers. Manual is fairly good. 


GATES ADVANCED PRIMARY READING TESTS 
Bureau of Publications Testing time: 40-50 min. 
Teachers College, Columbia University 
Range: 


rades 2.5 to 3 


The tests give two scores: (1) word recognition and (2) paragraph reading. 
They are easily administered. The relatively high ceiling makes them es- 
pecially well adapted for use in classes containing a number of superior readers. 
For second-grade pupils with limited reading ability the Gates Primary Read- 
ing Tests are probably more suitable. The test materials appear to be very 
attractive and interesting to pupils in these grades. The paragraph reading 
test measures only one narrow reading skill, the ability to follow directions. 


GATES BASIC READING TESTS 


Bureau of Publications Testing time: 60 min. 


Teachers College, Columbia University 
Range: Grades 3.5 to 8 


The test gives four scores: (1) reading to appreciate general significance 
(2) reading to predict outcome of given events, (3) reading to understand pre- 
cise directions, and (4) reading to note details. Scoring is somewhat laborious 
d on the test blank by underlining the correct answer. 
ed, and the manual very good. It is likely to be 
most useful in the middle of the range of grades for which it is intended. 
However, the wide range of easier materials should һе effective for measuring 
the reading ability of below-average readers in the upper grades. 


as answers are marke 
The test is easily administer 


GATES READING SURVEY FOR GRADES 3 TO 10 


ications ing time: 60-90 min. 
Bureau of Publications Testing time: 6 


Teachers College, Columbia University 

Range: Grades 3 to 10 

: (1) vocabulary, (2) level of comprehension, (3) 
is somewhat laborious since the correct 


answers are underlined on the test booklet. Elementar school students 
istration because of the difficult material included 
have indicated that the test tends to 
and place him above his true instruc- 


The test yields four scor ) 
speed, and (4) accuracy. Scoring 


sometimes show signs of frt 
for grade 10. Some users of the test 
overestimate a pupil's ability to read 


554 APPENDIX III 


tional level in reading. The manual is good. The test is probably most 
effective for general survey purposes in the middle of the range of grades for 
which it is intended: i.c.. grades 5, 6. 7, and 8. 


GATES PRIMARY READING TESTS 


Bureau of Publications Testing time: 25-30 min. 
Teachers College, Columbia University 
Range: Grades 1 to 2.5 


The test gives three scores: (1) word recognition, (2) sentence reading, and 
(3) paragraph reading. It is easy to administer and can be used for general 
survey or class diagnostic purposes. The manual gives specific sugge 
for use and interpretation of test results and gives suggestions for improving 
the types of reading measured by the test. The i is designed for use in a 
very narrow grade range and discriminates very effectively within that range. 
The test is primarily a power test. 


tions 


IOWA SILENT READING TEST: NEW EDITION, REVISED 


World Book Company Testing time: 50-60 min. 
Range: Elementary (grades 4 to 8): Advanced (grades 9 to 13) 


There are six scores on the Elementary Test: (1) rate and comprehension, 
(2) directed reading, (3) word meaning, (4) Paragraph comprehension, (5) 
sentence meaning, and (6) location of information. The Advanced Test gives 
seven scores: (1-6) the same as the Elementary Tesi, and (7) poetry compre- 
hension. The tests are speeded, and the reliability: coefficients that are re- 
ported are spuriously high. The total score on the test gives some weight to 
study skills, such as use of an index, that are s Idom found in a reading test. 
The manual is very good. The standard-score scale for the advanced tests is 
continuous with that for the elementary forms so th 
score level to the other is possible. 


at comparison from one 


METROPOLITAN ACHIEVEMENT TESTS: READING 
World Book Company 


Range: Elementary (grades 3 to 3): Intermediate 
(grades 7 to 9,5) 


Testing time: 35-45 min. 
(grades 5 to 7.5); Advanced 


The test gives three scores: (1) reading, (2) vocabulary, and (3) total. It 
is a subtest of Metropolitan Achievement Tests that can be purchased separately. 
The elementary reading test consists of two parts. Part I contains 11 para- 
graphs, each of which is followed by 3 multiple-choice questions, Part II is 
made up of selections varying in length from a single sentence to 10 or 11 
sentences. Each selection in part IT has one or more words missing that the 
student must write in: ie a completion type question. This section is some- 
what time-consuming to score and allows some subjectivity to the scoring. 
The vocabulary test consists of 50 words, each of which is followed by 4 choices. 
This test primarily measures the pupil's knowledge of synonyms. The inter- 
mediate and advanced reading tests consist entirely of selections with one or 
more words missing. The pupil must fill in the blank as in a completion type 
question. The manual is good and the reliabilities of the test are satisfactory. 


READING TESTS 555 


NELSON-DENNY READING TEST: VOCABULARY AND PARAGRAPH 
Houghton Mifflin Company Testing time: 30-35 min. 
Range: Grades 9 to 16 


The test yields three scores: (1) vocabulary, (2) paragraph comprehension, 
and (3) total. It has not been revised since 1938. It is easily scored since it 
can be purchased with the Clapp-Young Self-Marking Answer Sheets. Al- 
in high school and college, the test 


though designed for use at all grade levels 
has been most extensively used with high-school seniors and college freshmen. 
Time limits are rather brief. especially for high-school students. The manual 
emphasizes total score, and grade equivalents are provided only for total score. 
When a quick overview of the vocabulary knowledge and comprehension skill 
of high-school seniors or college students is desired, this test can be used 


efficiently. 


READING COMPREHENSION: COOPERATIVE ENGLISH TEST 


Cooperative Test Division Testing Time: 40-45 min. 


Educational Testing Service 

Range: Lower level, C1 (grades 7 to 12); higher level, € 
(1) vocabulary, (2) speed of comprehension, 
Speed of comprehension is a meas- 


^2 (grades 11 to 16) 


The test yields four scores: 
(3) level of comprehension, and (4) total. 
ure of rate of reading, but level of comprehension is probably less affected by 
speed than many comprehension measures. The comprehension section is 
arranged in three repeating scales of equal difficulty. Reliability is satisf 
tory. Many studies have showed that the results of these tests correlate 
school achievement. The test undertakes to measure 
ig comprehension such as understand- 


highly with measures of 
some of the more subtle aspects of reac 
ing of mood and purpose. This is one of the best tests designed for high-school 


and college students. 


SILENT READING COMPREHENSION: IOWA EVERY PUPIL 
Tests of Basic Skills, Test Л Testing time: 60 85 min. 
Houghton Mifflin Company 
Range: Elementary (grades 3 to 5): 
(1) reading comprehension, 


Advanced (grades 5 to 9) 


(2) vocabulary, 


The test. gives three. scores: 
and (3) total. Reading material at both levels is well chosen and includes 
udies. Norms are based upon Towa children. 

iections are probably somewhat difficult for the earlier grades 
The comprehension questions require real reading ability and 
e words to be defined are presented in 


content in science and social st 


The vocabulary 
of each level. 
thought. In the vocabulary test th 


sentence or phrases. The manual is very good. 


STANFORD ACHIEVEMENT TEST: READING 

World Book Company Testing time: 30-40 min. 

Range: Primary (grades 1.9 to 3.5); Elementary (grades 3.0 to 4.9): Intermediate 
(grades 5.0 to 6.9): and Advanced (grades 7.0 to 9.9) 

(2) word meaning, and (3) total. 


(1) paragraph meaning. : 
ont Test which can be purchased 


Three scores: | 
Stanford. Achievene 


This is a subtest of the 


556 APPENDIX 111 


separately, The paragraph meaning subtest consists of paragraphs with one 
or more words missing in each paragraph. The pupil must select the correct 
word for each blank from 4 words that are given. Though the test involves 
levels of comprehension varying from simple recognition to the making of in- 
ferences from several related sentences, the range of objectives measured is 
quite limited. The reliabilities of the test are satisfactory for group survey 
purposes. The manual is good, and the normative data are adequate. 


TRAXLER HIGH SCHOOL READING TEST 


Public School Publishing Company Testing time: 50-55 min. 
Range: Grades 10 to 12 


This test gives five scores: (1) reading rate, (2) comprehension, (3) main 
ideas, (4) total comprehension. and (5) total. Reliabilities are somewhat low 
for individual diagnosis using part scores but are adequate for group measure- 
ment. The emphasis is on comprehension. 


TRAXLER SILENT READING TEST 


Public School Publishing Company Testing time: 50-55 min 
Range: Grades 7 to 10 


Six scores are provided: (1) reading rate, (2) story comprehension, (3) word 
meaning, (4) paragraph meaning, (3) total comprehension, and (6) total. 
The data on norms are inadequate. The data presented in the manual con- 
cerning validity, reliability, and equivalence of forms are insufficient. The 
good points are (1) comprehension is emphasized; (2) the use of sentences or 
phrases to test word meanings is probably a more meaningful device than use 


of words alone; and (3) the time allowed on each part is enough so that speed 
is not a factor at any of the grade levels. 


Section D 


ELEMENTARY-SCHOOL ACHIEVEMENT BATTERIES 
AMERICAN SCHOOL ACHIEVEMENT TESTS 
Public School Publishing Company 
Range: Primary Batlery I, grade 1 (35-50 min.) 
Primary Battery 11, grades 2 to 3 (85-105 min.) 
Intermediate Battery, grades 4 to 6 (127-147 min.) 
Advanced Battery, grades 7 to 9 (127-147 min.) 


АП the batteries test reading and arithmetic. 


Starting with Primary Bat- 
lery II. language usage and spelling are added. 


The reliabilities of subtests 


ELEMENTARY-SCHOOL ACHIEVEMENT BATTERIES 557 


are rather low (some as low as 0.72). Primary Battery I tests word recogni- 
tion and meaning rather than reading. The tests are somewhat speeded. 
Normative data are inadequate. This battery is probably somewhat less 
adequate for surveying pupils and achievement than some of the other bat- 
teries described below. 


CALIFORNIA ACHIEVEMENT TESTS 
California Test Bureau 
Range: Primary Battery, grades 1 to 4.5 (90-110 min.) 
Elementary Battery, grades 4 to 6 (120-135 min.) 
Intermediate Battery, grades 7 to 9 (150-165 min.) 


The tests are essentially survey tests of reading, arithmetic, and language. 
The tests in the Primary Battery are excessively difficult. The tests are not 
diagnostic and the procedures given in the manual for use of the part scores 
should not be followed because the scores are not sufficiently reliable. & 
significant feature of the tests is that they offer scores for the same 10-part, 
subtotal, and total categories from grade 1 through 14. They attempt to 
measure a wide range of objectives but sampling of items is limited. Inade- 
quate data are provided on the normative population. The primary and 
elementary batteries are inadequate for measuring the achievement of poor 
learners. The intermediate battery is probably most satisfactory. The tests 


appear not to be speeded. 


COORDINATED SCALES OF ATTAINMENT 
Educational Test Bureau 
Range: Grades 1 through 8 —a separate form for each grade level. (Testing 

time from 90 minutes for primary grades to about 256 minutes for grades 

4+ to 8. Pupils are allowed to work on a section until 90 per cent complete 
it.) 

For grades 1 to 3 the battery covers reading, arithmetic, and spelling. The 
content areas of language, history, geography, science, and literature are 
added for grades 4 to 8. Manuals for the battery are very good. The tests 
of content areas tend to measure factual information and would probably not 
be deemed suitable for measuring other important objectives in these areas. 
The content in the arithmetic tests has been criticized as being too narrow 
to satisfy current ideas of the desired outcomes in this area. For an elementary- 
school testing program that stresses information and skills these tests would 


be suitable. 


IOWA EVERY-PUPIL TESTS OF BASIC SKILLS 
Houghton Mifflin Company Testing time: . 
я Elementary, 196-230 min. 

Advanced, 263-325 min. 


; Advanced (grades 5 to 9). 


Range: Elementary (grades 3 to $ 


The tests included are Silent Reading Comprehension, Work-Study Skills, 
Basic Language Skills, Basic Arithmetic Skills The tests are of the survey 
type designed to measure the pupils’ functional mastery of a wide variety of 


558 APPENDIX Ill 


skills in reading. work and study. language, and arithmetic. The manual 
gives considerable aid to the teacher in using the test results effectively. Out- 
standing features are the inclusion of the work study test and the scope of 
content covered. The tests are easy to give and score. The elementary bat- 
tery is considered better for use in grade 5 than the advanced battery. 


METROPOLITAN ACHIEVEMENT TESTS 


World Book Company 

Range: Primary I Battery, grade 1 (45-60 min.) 
Primary II Battery, grade 2 (85-100 min.) 
Elementary Battery, grades 3 to 4 (135-150 min.) 
Intermediate, grades 5 to 7.5 (220-240 min.) 
Advanced, grades 7 to 9.5 (220-240 min.) 


АП the batteries measure reading and arithmetic. Spelling is added at 
grade 2 and above, and language usage at grade 3 and above. The complete 
batteries at the intermediate and advanced levels include tests in the con- 
tent areas of geography, history and civic 
ously mentioned tests. 


ind science, in addition to the previ- 
Partial batteries are available at these levels that 
include only the reading, arithmetic, English, and spelling tests. The test 
manual is very good. Good norms are provided. There is some tendency for 
content to adhere rather closely to old subject matter outlines and to empha- 
size factual content rather than understanding and application of knowledge. 
The battery does not test work study skills. Reliabilities of the tests are high 
(Median value 0.91). The partial batteries of skill subjects are more suitable 


than the tests of the content arcas. 
STANFORD ACHIEVEMENT TESTS 


World Book Company 
Range: Primary (grades 1.9 to 88 
mediate (grades 5 to 6) 


Testing time: 80-215 min. 
Elementary (grades 3.0 to 4.9); Inter- 
: Advanced (grades 7 to 9) 


Reading, spelling, and arithmetic are included in all the batteries. Language 
is introduced in the elementary battery and is included in all batteries above 
that level. Social studies, science, and study skills are added in the inter- 
mediate and advanced batter The battery of tests was extensively revisec 
in 1953. Reliabilities range from 0.81 to 0.92, based on a single grade computed 
by the split-half method, and compare favorably with those for other batteries 
Partial batteries are available at the intermediate and advanced levels which 
include only the reading. language, arithmetic, and spelling tests. Separate 
tests are also available at these levels in arithmetic, reading. study skills, 
science, and social studies. The items are well constructed. The manual is 
excellent. One may find the content areas a little too factual in nature: 


HIGH-SCHOOL ACHIEVEMENT BATTERIES 559 


Section E 


HIGH-SCHOOL ACHIEVEMENT BATTERIES 
CALIFORNIA ACHIEVEMENT TESTS 


California Test Bureau 
Range: Advanced Battery, grades 9 to 14 (150-165 min.) 


This is a continuation of the battery listed in Section D. The tests are 
probably adequate for a survey of achievement in reading, language, and 
arithmetic for high-school students but are not adequate for superior students 
above grade 12. Grade norms, which are given for this test, are inappropriate 
for the level for which the test is intended. Percentile norms are also given. 


COOPERATIVE GENERAL ACHIEVEMENT TESTS, REVISED SERIES 


Educational Testing Service 

Range: Grades 10 to 12 and college entrants (40-45 min. for each test) 
Test I. Test of general proficiency in field of social studies. 
Test II. Test of general proficiency in field of natural science. 
Test III. Vest of general proficiency in field of mathematics. 


These tests attempt to measure, with varying degrees of success. general 
proficiency in each of the above fields. Each test has a section on vocabulary 
knowledge in its field and а second section that attempts to measure compre- 
hension and different aspects of critical thinking. Part II of the social studies 
hted with the reading comprehension type of item. Part 


test is heavily weig 
the direction of the physical sciences. 


Il of the science test is weighted in 
Different forms of the test are not exactly comparable in the types of skills 
they test. The test probably has greatest value in placement of students or 
counseling students on their future course of study. 


ESSENTIAL HIGH SCHOOL CONTENT BATTERY 


World Book Company 
Range: Grades 9 to 13 


Testing time: 200-225 min. 


science, social studies, 


Four fields are covered in the battery: mathematic: 
The items appear well constructed, the manual is very good, 
and the norms are very complete. The reliabilities of separate tests are some- 
what low, ranging from 0.67 for science in grade 10 to 0.92 for mathematics 
These low reliabilities make the interpretation of differences 
between tests for individuals a somewhat hazardous procedure. This is prob- 
ably the best battery available for high schools in general for measuring rela- 
tively immediate objectives of instruction. It is probably more suitable for 
measuring achievement in academic programs than in general or commercial 


and English. 


in grade 11. 


programs. 


560 APPENDIX Ш 


THE IOWA TESTS OF EDUCATIONAL DEVELOPMENT 


Science Research Associates Testing time: 4599480 min. 
Range: Grades 8.5 to 13.5 


The battery vields 10 scores: understanding of basic social concepts, general 
background in the natural sciences. correctness and appropriateness of ex- 
pression, ability to do quantitative thinking, ability to interpret reading ma- 
terials in the natural sciences, ability to interpret literary materials, general 
vocabulary, the subtotal of these 8 tests, and using sources of information. 
The manuals are very good. The test battery is designed to yield objective 
evidence of the degree to which concepts are understood rather than the de- 
gree to which isolated facts and operations are recalled. As a measure of 
certain broad aspects of the pupil's educational development, this battery is 
definitely superior. 


IOWA HIGH SCHOOL CONTENT EXAMINATION, 1943 EDITION 
Bureau of Educational Research and Service Testing time: 75-85 mi. 
State University of Iowa 
Range: Grades 12 to 13 


Five scores result from the test: English and literature, mathematics, sci- 
ence, history and social studies, and total. The English and literature sec- 
tion samples the student's vocabulary, grammar, and literary acquaintance 
but gives little opportunity for the student to demonstrate his comprehension 
of literature. The mathematics section stres 
Physical concepts receive the greatest empha 
history and social studies section gives greate: 
tributions to modern history. The tests in the battery are probably too short 
to sample adequately the student's achievement in these fields. ‘The test 
might have value in helping to place new students 


> algebraic concepts heavily. 
s in the science section. The 
st emphasis to American con- 


Section F 


INTEREST INVENTORIES 


BRAINARD OCCUPATIONAL PREFERENCE INVENTORY 
Psychological Corporation Testi ime: i 

À i esting time: 30 min. 
Range: Grades 9 to 12, adults TUM 
А A list of 140 activities classitied into 28 sections and 7 occupational fields 
is presented. The respondent marks each activity on a 
follows: +2—like it very much: +1—like it son ; 


-point s 
iewhat; 0— indifferent about: = 1 


INTEREST INVENTORIES 561 


— dislike it somewhat; — 2 dislike it very much. Scores are obtained for 
each of the 7 occupational fields and for each of the 28 sections by adding the 
numerical values. Norms are presented separately for men and women. The 
numbers in the norm group are small. Interest scores are interpreted on a 5- 
point scale—very low, low, average, high, and very high. The reported re- 
liability of the inventory is meaningless, and the true reliability is probably 
is probably not as useful in its present form as the А мел 


low. The inventory 
or Strong. No validity data are reported. 


CLEETON VOCATIONAL INTEREST INVENTORY 
Мека & MeKnight, Bloomington, Ill. Testing time: 40-50 min. 
Range: Grades 9 to 12, adults 
Editions for men and for women 


This is an extended inventory, calling for checking of liking for occupations, 
activities, and types of people. Items are grouped into subsections of 70, 
each of which yields a score for a group of occupations designated as related. 
The inventory yields 9 scores for such occupational groupings as physician 
and biological science, specialized selling, business administration, ete. Valid- 
ity was originally based on the items’ ability to discriminate individuals in a 
particular occupational group. The inventory is relatively simple and quick 
However, no adequate norms are provided and little evidence is 


to score, 
given on the validity of the score patterns it yields. 
KUDER PREFERENCE RECORD—VOCATIONAL 


Science Research Associates 
Range: Grade 9 and above 


Testing time: 30-50 min. 


See discussion in text (pp. 377-378). Especially appropriate for use with 


high-school groups. 


KUDER PREFERENCE RECORD—PERSONAL 
Science Research Associates Testing time: 40-45 min. 
Range: Grades 9 to 16 and adults 


Using the same pattern for items as the Auder Preference Record— Voca- 
tional, this inventory appraises liking for 5 more aspects of life situations; 
being active in groups, being in familiar and stable situations, working with 
ideas, avoiding conflict, and directing others. The scores are fairly independ- 
ent of each other and of those in the Vocational blank. The value of these 
scales for guidance purposes is less fully explored than that of the scales in 


the Vocational form. 


MANSON OCCUPATIONAL INTEREST BLANK FOR WOMEN 


Psychological Corporation (distributor) Testing time: 15-20 min. 


Range: Grade 9 and above 

This inventory consists of a list of 160 occupations, with respect to which 
the examinee expresses liking or dislike. It yields scores for specific occupa- 
tions. Scoring keys were based on the items ability to differentiate the par- 


562 APPENDIX Ill 


ticular occupational group. It is briefer than the Strong Vocational Interest 
Blank for Women, but on other counts the Strong inventory is to be preferred. 


STRONG VOCATIONAL INTEREST BLANK FOR MEN (REVISED) 


Stanford University Press Testing time: 40 min. (approx.) 
Range: 17 years and over 


See discussion in text (pp. 374-377). Particularly suitable for college groups. 
STRONG VOCATIONAL INTEREST BLANK FOR WOMEN (REVISED) 


Stanford University Press 
Range: 17 years and over 


Testing time: 40 min. (approx.) 


This inventory of 400 items is similar to the blank for men (see pp. 374 377). 
It yields scores for 24 occupations, chiefly at the professional level. It was 
developed in the same way as the Vocational Interest Blank for Men, but has 
been less thoroughly studied. The blank is particularly suitable for use with 
college groups. 


STUDY OF VALUES—(REVISED EDITION) 


Allport-Vernon-Lindsey Testing time: 20-30 min. 
Houghton Mifflin Company 
Range: College students, adults 


The Study of Values is supposed to measure the relative dominance of 6 
basic interests or motives in personality: theoretical, economic, esthetic. 
social, political, and religious. The classification is based upon Sprangers 
Types of Мен. The reliabilities of each of the subscales range from 0.73 to 
0.90 computed by the split-half method and from 0.77 to 0.92 computed by 
the retest method after a 1-month interval. "The inventory was standardized 
on a college population. The test is perhaps better suited for research pur- 
poses than for vocational guidance. 


THURSTONE INTEREST SCHEDULE 


Psychological Corporation Testing time: 10 min. 
Range: Grades 9 to 16, adults : 

This very brief inventory is based on choice 
tions. High reliability is reported, despite the 
yields scores for 10 vocational fields: 
putational, business, executive, pe 


s between 100 pairs of occupa- 
brevity of the instrument. It 
physical science, biological science, com- 
suasive, linguistic, humanitarian, artistic, 
and musical. The test is based entirely on internal analysis, with no external 
evidence of validity. No norms are provided, each person's raw score profile 
being interpreted in terms of its high and low points. A quick and plausible 


but untested instrument for assessing occupational interests at the professional 
level, 


ADJUSTMENT AND TEMPERAMENT INVENTORIES 563 


Section G 


ADJUSTMENT AND TEMPERAMENT INVENTORIES 
ADJUSTMENT INVENTORY (BELL) 


Stanford University Press Testing time: 25 min. 
Range: Grades 9 to 16, adults 


The student and adult forms of this inventory provide an efficient means 
of getting self-appraisal in the areas of health, home. social, emotional, and 
vocational (adult form) adjustment. Scores are satisf storily reliable and 
appear to discriminate extreme groups identified by experienced counselors. 
This inventory has been found useful to identify cases for more intensive 
study and to bring out leads to be explored in a counseling interview. 


GUILFORD-ZIMMERMAN TEMPERAMENT SURVEY 
Sheridan Supply Company Testing time: 50 min. 
Range: Grades 9 to 16 and adults 

See text, pp. 383-387. One of the best inventories for describing aspects of 
normal personality. Experience is needed to determine whether the dimen- 
sions are of practical importance for personal or vocational counseling. 


HESTON PERSONAL ADJUSTMENT INVENTORY 
World Book Company Testing time: 40-50 min. 
Range: Grades 9 to 16 and adults 

A relatively new and workmanlike inventory yielding scores for analytical 
thinking, sociability, home satisfaction, emotional stability, confidence, and 
personal relations. The last three are rather closely related. Could suitably 
he used as rough screening device, or provide leads for further investigation 


by interview. 


MINNESOTA MULTIPHASIC PERSONALITY INVENTORY 


Psychological Corporation Testing time: 30-90 min. 


Range: Age 16 and over 


For discussion see text, pp. 387-391. This instrument is oriented towards 
r than normal groups, and is designed to differentiate between 
them, There seems to be some doubt that it does this very effectively. It is 
rather lengthy to use as a screening test. However, the profile based on the 
separate scale scores provides a good deal of material for interpretation by 


the sophisticated counselor or clinical psychologist. 


abnormal rathe 


564 APPENDIX 111 


MINNESOTA PERSONALITY SCALE 


Psychological Corporation 
Range: Grades 11 to 16 (separate forms for men and women) 


The inventory yields 5 scores designated morale, social adjustment, family 
relations, emotionality. and economic conservatism. The first 4 are fairly 
substantially correlated and provide a picture of aspects of adjustment. The 
inventory is quite reliable and reasonably efficient to use. It can provide a 
rough initial screening of cases for more intensive study or leads to be followed 
up in interview. 


MOONEY PROBLEM CHECK LIST 


Psychological Corporation Testing time: 20-40 min. 
Range: Forms for grades 7 to 9, 9 to 12, 13 to 16, and adults 


These check lists provide a s 


tematic coverage of problems often reported 
or judged significant at the different age levels. Though the items are grouped 
by areas (health and physical development: courtship, sex, and marriage: 
home and family: etc.) and a count can be made of items marked in each 
area, emphasis is placed on using the individual responses as leads and open- 
ings for an interview. This instrument does not claim to be a test and the 
use proposed for it is the type th 
instrument. 


at is probably most justifiable for a self-report 


PERSONALITY INVENTORY (BERNREUTER) 
Stanford University Press Testing time: 25 min. 
Range: Grades 9 to 16, adults 

This inventory has been widely used over the years. perhaps more widely 
than its characteristics justify. Tt yields es entially two distinct scores— one 
for neurotic tendency (introversion and submissiveness scores are highly corre- 
lated with this) and one for self-sufficiency. Validity of the resulting scores 
Is questionable, but when administered with good rapport it may be of some 


value in identifying cases for further study or in providing leads for an inter- 
view. 


THURSTONE TEMPERAMENT SCHEDULE 


Science Research Associate 


i Testing time: 10-20 min. 
Range: Grades 9 to 16 and adults 


This is a brief and rather unreliable 
of temperament: active, vigorous, im 
reflective. Because of the relativel 
doubtful value for counseling individ 
stricted to certain types of 


instrument yielding scores for 7 aspects 
pulsive, dominant, stable, sociable. and 
y low reliabilities, the inventory is of 
uals, and its use should probably be re- 
surveys and research studies. 


Appendix lV 


Sourees for Edueational and 
Psychological Tests 


California Test Bureau 
5916 Hollywood Blvd. 
Los Angeles 28, California 

Primarily achievement and intelligence tests and interest and personality 
inventories for elementary and high school. Publishes the California Achieve- 
ment Tests (formerly Progressive Achievement Tests). Provides IBM services 
and technical advice on research problems. Publishes a series of Educational 
Bulletins (furnished free upon request) on the selection and use of tests and 


testing programs. 


ucational Test Bureau 

Educational Publishers Ine. 
720 Washington Ave. 5 
Minneapolis, Minnesota 


Publishers of Kuhlmann Tests of Mental Development, Minnesota Pre-school 
Scale, Coordinated Scales of Attainment (achievement test for elementary 
school), Unit Scales of Attainment, Vineland Social Maturity Scale, Minne- 
sola Rate of Manipulation and Spatial Relations Test. Books for group guid- 
ance and developmental reading series, three reading books for retarded read- 
ers in elementary grades, and various kinds of school records. 


Educational Testing Service 
Cooperative Test Division 
Princeton, New Jersey 


Publishers of the American Council on Education Psychological Examina- 
tions, Cooperative Achievement Tests for high schools and colleges, the Coopera- 
tive Inter-American Tests (English and Spanish Editions), the United States 
Armed Forces Institute Tests of General Educational Development, the College 
Entrance Board Examinations, and Graduate Record Examinations. Sponsors 
two national testing programs, one for high schools and one for colleges. 

Maintains an Evaluation and Advisory Service. Es cept for unusual re- 
service will provide assistance in selecting 

Not limited to tests 
test-scoring service at 


quests the service is free. Th 
tests, setting up test programs, and using test results. 
published by ETS. The company will also supply 


cost. 
565 


566 APPENDIX IV 


Psychological Corporation 
522 Fifth Avenue 
New York 18, N. Y. 

Publishers of Differential Aptitude Tests. Wechsler-Bellevue Intelligence Scale, 
Wechsler Intelligence Scale for Children. Porteus Mazes, Arthur Point Scale of 
Performance Tests, Gesell Developmental Schedule, MMPI. and Mooney Prob- 
lem Check List. Distributors of Strong Vocational Interest Blank, Rorschach. 
Essential High School Content Battery, Evaluation and Adjustment Series, Met- 
ropolitan Achievement Tests, Stanford Achievement Tests, Gates Reading Tests. 
Durrell Analysis of Reading Difliculties, Kuhlmann-Anderson Intelligence Tes 
Henman- Nelson Tests of Mental Ability, Otis Mental Ability Tes 
General Ability Tests. 

Wide variety of clinical tests, tests for industrial uses, and group tes 


‚апа Pintner 


sof 
aptitude and personality. Distributor and publisher of many texts in clini- 
cal, vocational, and industrial psychology. 

Composed of five divisions: 


1. Test Division: handles sales of tests, advisory services available for test 
users, and statistical and scoring services. М 

2. Market and Social Research. Division. 

3. Industrial Division. 

4. Clinical Division: provides aptitude testing and practical. psychological 
counsel on educational, vocational, and. personal problems of adults and chil- 


dren; conducts testing programs and studies of individual pupils and other 
psychological work for schools and institutions. 
5. Professional Examinations Divis 


on. 


Publishes a series of Test Service Bulletins which are free. 


Science Research Associates Inc, 
57 West Grand Avenue 
Chicago 10, Illinois 


Publishers of SRA Primary Mental Abilities 
tal Abilities, Army General Classification Tes 
logical Test, Kuder Preference Records, SRA Youth Inventories, Толга Hevery- 
Pupil Tests, lowa Tests of Educational Development, various tests of voci- 
tional aptitudes and skills, SRA Guidance Publication ў 
lished monthly; subscribers receive bonus items and are entitled to the use of 
SRA Research Service), Guidance Filmstrips, Occupational Information. Ma- 
terials, elementary school textbooks, booklets and texts for guidance classes 


in high schools, reading improvement materials, and professional guidance 
publications. | 


Chicago Tests of Primary Men- 
1. Ohio State University Psycho- 


s (subscriptions. pub- 


C. H. Stoelting Co. 
424 North Homan Ave, 
Chicago 24, Illinois 


Motor skill and coordination tests (e 
ment), formboard space relations te 
school, college. and ele 
mental deterioration, 


specially those requiring special equip- 
sts, and general intelligence tests for high 
mentary school (group and indiv idual); clinical tests 9 
concept formation, and organic brain damage: preschool 


SOURCES FOR TESTS 567 


tests of intelligence and. reading readiness tests: clerical aptitude and clerical 
skill tests; mechanical aptitude tests: Horn, Knauber, and Meier art 
tests; musical aptitude tests: reading and achievement tests for elementary 
and high schools. Handles Cooperative Tests, Iowa Placement Examinations, 
Metropolitan, Stanford, lowa Every-Pupil Tests of Basic Skills, Strong Voca- 
tional Interest Blank, Bell Adjustment Inventory, Bernreuter Personality Inven- 
lory, Vineland Social Maturity Scale, Rorschach, TAT, CAT, and Symonds 
Picture Story Test. 


World Book Company 
313 Park Hill Avenue 
Yonkers 5, New York 
Achievement batteries and tests for elementary and high school best 
known are Metropolitan Achievement Battery, Stanford Achievement Test: group 
intelligence tests— Otis, Pintner, Terman-MeNemar: Metropolitan Readiness 
Test and Stevens Reading Readiness Test: large number of reading tests 
Durrell-Sullivan. Reading Capacity and Achievement Tests, Iowa. Silent Reading 
Tes Very few personality inventories (Hestan). 
Division of Test Research and Service provides help in selecting and using 
tests and in interpreting test results. | 
Has a series of articles called Test Service Notebook and Test Service Bulletins 
which present briefly results secured from the use of tests, developments 1n 
the field, planning testing programs, ete. 


Index 


Ability grouping, 236-237 
Ability tests, 21, 202 
definition, 202 
use, 202 
Achievement tes ‚ 21, 203, 269-295, 
556-558, 559 500 
batteries, 285-291, 556-558, 559 500 


definition, 21 
diagnostic, 278-284 
early forms, 5 

product scales, 284-285 


reading, 271-278 
standardized vs. teacher-made, 269- 
271 


survey, 275-278 
uses, 21, 291-295 
Adjustment, 23, 382-394, 563-564 
inventories, 382-394, 503 504 
Age norms, 156-159 
American Council on Education Psycho- 
logical Examination for College 
Freshmen (ACE), 547 
American Council on Education Psycho- 
logical Examination for High 
School Students, 547 
American School Achievement Tests, 556 
Anecdotal Records, 324-330 
Annual Review of Psychology, 198 
Answer sheets, 72, 142-143 
IBM, 143 
self-scoring, 142-143 
„ teacher-made, 72 
Aptitudes, 21, 204-205 
Aptitude tests, 202-241, 245-266, 551 
552 


art, 265-266 
batteries, 250-256, 551-552 


569 


Aptitude tests, clerical, 250 
intelligence, 202-241 
mechanical, 248-249 
music, 263-204 
occupational, 246-260 
professional school, 262-263 
validation of, 256-260 
vocational, 246-260 

Arithmetic mean, 90-93 

Army Alpha, 5 

Army General. Classification. Test, 

233, 507, 547 

Art aptitude tests, 265-266 

Arthur Point Scale, 216-217 

Attitude, 23, 394-396 
definition, 23 
questionnaires, 394-396 


232- 


Bell Adjustment Inventory, 563 

Bennett Stenographic A ptitude Tests, 262 

Bernreuter Personality Inventory, 564 

Binet, Alfred, 4, 5 

Biographical Data Blank, 372-373 

Blueprint for test, 32-34 

Brainard Occupational Preference Inven- 
tory, 560-561 

Buswell-John Diagnostic Test for Funda- 
mental Processes in Arithmetic, 
282, 283 

California Achievement Tests, 285, 287, 
288, 289, 290, 557, 559 

California First-Year Mental Scale, 219 

California Pre-School Scale, 219, 229 

California Reading Test, 552 

California Short Form Test of Mental 
Maturity, 548 


570 


California Test Bureau, 565 
California Test of Mental Maturity, 174, 
176, 548 
class record, 174 
individual profile, 176 
Cattell, James McKeen, 5 
Central tendency, 87-88, 90-93 
mean, 90-93 
median, 88 
mode, 87 
Certification of pupils, 27 
Character, definition, 22 
Character Education Inquiry, 300-304 
Chicago Non-Verbal Examination, 548 
Chicago Tests of Primary Mental Abili- 
lies, 551 
Class interval, 81-85, 89 
real limits, 89 
Cleeton Vocational Interest Inventory, 561 
Clerical aptitude tests, 250 
Clinical psychology trainees, assess- 
ment program for, 309-311 
Combining test scores, 522-525 
Compass Diagnostic Arithmetic Tests, 
282 
Completion items, 57-58 
Cooperative General Achievement Tesls, 
Revised Series, 559 
Coordinated Scales of Attainment, 285, 
287, 288, 289, 290, 557 
Correction for gu 


ing, 73 
Correlation. coefficient, 100 -104, 543- 
546 
computation of, 543-546 
Criterion, 116-119, 247, 512-515 
for job selection, 247, 512-515 
qualities desired, 118 
selection of, 117 
Culture Free Intelligence Test, 221-222 
Cumulative records, 440-448 
Cutting scores, 527-528 


Darwin, Charles, 4 
Davis-Eells Games, 223-224 
Diagnosis, 27, 278-284 
clinical, 530- 539 
differential, 533-539 
tests for, 278-284 
arithmetic, 282 284 
reading, 278-282 


INDEX 


Differential Aptitude Test Battery, 178. 

251-254, 506, 551 

Durrell-Sullivan Reading Capacity and 
Achievement Tests, 552 


Education Index, 198 
Educational objectives, defining, 29 
Educational Test Bureau, 565 
Educational Testing Service, 565 
ERC Stenographic Aptitude Test, 262 
Errors of measurement, 9, 123-124 
Essay test, 35-39, 42-47 
38 
comparison with objective test, 42 
improvement of, 44-46 
scoring, 37-39 
use, 42-43 
variations of, 44 
Essential High School Content Battery, 559 
valuation, definition of, 26 
luation procedures, 16-17 


characteristics of, 


Examination, oral, 1-2 
Examinations, written, 2 
Expectancy tables, 494 


Experimental psychology, 4 


Factored A ptitude Series, 255, 551 

Flanagan Aptitude Classification Tests 
(FACT), 255, 551 

Four-Picture Test, 408 

Frequency distribution, 84 

Frequency polygon, 87 


Galton, Sir Francis, 4 3 

Gales Advanced Primary Reading Tests 
553 

Gates Basic Reading Tests, 553 

Gates Primary Reading Tests, 554 

Gates Reading Diagnosis Tests, 282 

Gates Reading Readiness Test, 261 

Gates Reading Survey for Grades 3 10 10, 
278, 553 x 

General Aptitude Test Battery (GATB). 
255, 506-507, 5 

Goodenough Draw-a-Man Test, 217 

Grade norms, 160-161 

Graves Design Judgment Test, 205 

Gray's Oral Reading Passages, 281 

Guidance, educational and vocational 
492 508 


INDEX 


Guilford-Zimmerman A plitude Survey. 
256, 552 

Guilford-Zimmerman Temperament Sur- 
vey, 383-387, 563 

Henmon-Nelson Test of Mental Ability, 
549 

Heston Personal Adjustment Inventory, 
563 

Hildreth, Gertrude H., 

Histogram, 86 


Horn Art Aptitude Inventory, 265 


189, 192, 193 


Individual differences, 4, 5 
Inference, statistical, 83 
Information about tests and 
189-200, 504-507 
sources, 189-200 
vocational significance of test scores, 
504-507 
Intelligence, 230-236 
group differences, 234-236 
job success, 233-234 
occupational level, 232-233 
school success, 230-231 
Intelligence quotients, 170-172, 228- 
computation, 170 
equivalency from different tests, 172 
Stanford-Binet, 170-172 
stability of, 228-230 
Intelligence tests, 56, 205-230, 547-551 
culture free, 221-224 
early history, 5-6 
group, 208-209, 224-225, 547-55 
individual, 209-214, 216-218, 22 
infant, 218-220 
non-language, 215-216 
performance, 216-218 
pre-school, 218-220 
reliability, 226-230 
types of items, 205-208 
ses, 230-234, 236-240 
sts, 23, 373-383, 
504, 560-562 
and ability, 381-382 
definition, 23 
inventories, 373-382, 560-562 
permanence, 381 
reliability, 379-380 
validity, 381 


testing, 


t2 


30 


51 
4-226 


495-497, 503- 


571 


Interests, and vocational goals, 495-497, 
503-504 

Interpretation of scores, 

448-453, 497-500 

Every-Pupil Tests of Basic Skills, 

285, 287-289, 290, 557 

Iowa High School Content Examination, 
1943 Edition, 560 

Iowa Silent Reading Test: New Edition, 

Revised, 274, 278, 280, 554 

lowa Tesis of Educational Development, 
290-291, 560 

1Q (see Intelligence quotients) 


100, 144-146, 


Iowa 


Item analy 74-78 
Item editing, 70 
Item writing, 51—69 


Knauber Art Ability Tests, 266 

Kuder Preference Record (Personal), 561 
"тепсе Record (Vocational), 
-379, 380, 561 


37 
Kuder-Richardson Formula 20, 131 
Kuder-Richardson Reliability Coeffi- 


cient, 130 
Kuhlmann-Anderson Intelligence Tests— 
Sixth Edition, 549 


Lee-Clark Reading Readiness Test, 261 
Letters of recommendation, 335-336 
Lewerens Tests in Fundamental Abilities 
of Visual Art, 266 
Lorge-Thorndike Intelligence Test, 
Verbal Series, ‚ 549 
Intelligence Tests, 549 


Non- 


Make A Picture Story Test (MAPS), 408 
Manson Occupational Interest Blank for 
Women, 561 
Marking and reporting, 456-488 
functions of, 458-474 
technical aspects of, 474-488 
Matching item, 63-66 
improvement of, 63-66 
use of, 63 
May, M. A., and Hartshorne, H., 
304 
Measurement, 10-11, 12 
criticism of, 12 
refinement in, 10-11 
Mechanical aptitude tests, 248-249 


300- 


572 


Median, 88-89 
Meier Art Judgment Test, 264-265, 266 
Mental Measurements Yearbooks, 190, 
194-195, 271 
Merrill-Palmer Scale, 220-221 
Metropolitan Achievement Test, 173, 177, 
285, 287, 288, 289, 290, 558 
Metropolitan Achievement Tests: Read- 
ing, 273, 554 
Michigan Speed of Reading Test, 278 
Minnesota Multiphasic Personality In- 
ventory, 387-391, 563 
Minnesota Personality Scale, 564 
Minnesota Pre-school Scale, 220-221 
Minnesota Vocational Test for Clerical 
Workers, 250, 506 
Modal interval, 87 
Mode, 87 
Mooney Problem Check List, 564 
Motivation, 26 
Multiple choice item, 58-63 
difficulty of, 59 
improvement of, 59-63 
use of, 58 
Multiple correlation, 519-520 
Musical aptitude tests, 263-264 


National Achievement Tests, 285, 
288, 289, 290 
Nelson-Denny Reading Test: Vocabulary 
and Paragraph, 555 
Normal curve, 97, 98 
Normal distribution, 97-99 
and standard deviation, 98-99 
Norms, 153-186 
interchangeability of, 168-170 
need for, 153-156 
principles guiding use of, 183-185 
for group, 183-184 
for individual, 184-185 
types of, 156-168 
age, 156-159 
grade, 160-161 
percentile, 162-165 
standard score, 165-168 


287, 


Objective tests, 35-42, 50-78 
analysis of results, 74-78 
arrangement of items, 70-71 
characteristics of, 40—41 


INDEX 


Objective tests, comparison with essay. 
42 
instructions for examinee, 71-7 
item types, 39-40 
reproduction, 70 
scoring, 73 
Objectives, educational, 9, 27-30, 272- 
273 
Observational 
311-331 
advantages, 320-321 
anecdotal records, 324-330 
improvement of, 312-315 
informal, 324-330 
limitations, 321-324 
studies using, 315-320 
svstematic, 311-324 
Ohio State University Psychological 
Test, Form 21, 549 
OSS Assessment Program, 305-311 
Otis Quick-Scoring Mental Ability Tests, 
550 


t2 


procedures, 19-20, 24, 


Partial correlation, 519 
Pearson, Karl, 4 
Percentile norms, 162-165 
advantages, 162-163 
definition of, 162 
interpretation, 163 
limitations of, 162-165 
Percentiles, 89-90 
Personality, aspects of, 21, 22-23 
Personality inventories, 382-394, 563- 
564 
evaluation, 392-394 
use, 394 
Personality measurement, 298-416, 563- 
564 
behavioral measures, 298, 300-305 
methods of, 298-300 
nominating techniques, 355-360 
observation, 311-331 
projective techniques, 400—416 
ratings, 337-366 
self-report, 371-397, 563-564 
situational tests, 305-311 
sociometric techniques, 355-360 
Personnel selection, 510-528 
Picture Story Test, 408 


Pintner General Ability Tests, 215. 505 


INDEX 


Pintner General Ability Tests, non-lan- 
guage series, 215, 550 
verbal series, 215, 550 
Practicality of test, 108, 141-146 
Product scales, 284-285 
Professional school aptitude tests, 262- 
263 
Proficiency tests, 203 
Profiles, 172-182 
examples of, 173-176 
interpretation, 177-182 
Prognostic tests, 260-262 
reading readiness, 260-261 
shorthand, 262 
stenographic, 262 
Projective tests, 17, 400-417 
practicality, 417 
reliability, 415-416 
validity, 411-415 
Psychological Abstracts, 198 
Psychological Bulletin, 198 
Psychological Corporation, 566 
Psychological measurement, history of, 
2 
Psychology, experimental, 3-4 
Psychophysics, 3 


Q, 94 
Quartile, 94 
Quotients, 170-172 


r, 101-102 
Range of scores, 93-94 
Ratings, 337 366 
check list, 349 
factors affecting accuracy, 340 344 
forced-choice, 360-362 
graphic, 350 351 
improvement of, 347 355, 362-366 
man-to-man, 351 
peer group, 355-360 
problems in obtaining, 338-340 
reliability, 346 
validity, 346-347 
Reading Comprehension: Cooperative 
English Test, 555 
Reading Readiness Tests, 260-261 
Reading tests, 552-556 
Regression weights, 520-521 
Reliability, 108, 123-141, 178-182 


573 
Reliability, comparison of different 
methods, 131-132 
and correlation between variables, 
140-141 
difference score, 178-182 
equivalent form, 127-128 
from item statistics, 130-131 
Kuder- Richardson, 130-131 
parallel test, 127-128 
Spearman-Brown Prophecy Formula, 
129 
split-half, 128-130 
subdivided test, 128-130 
test-retest, 125 
Reliability coefficient, 125, 132-139 
factors influencing, 135-138 
interpretation, 132-140 
Reporting to parents, 464-470 
Review of Educational Research, 197-198 
Revised Minnesota Occupational Rating 
Scales, 504-506 
Rorschach Test, 401-406 
interpretation, 404-406 
reliability, 415-416 
scoring, 403-404 
validity, 411-415 


Science Research Associates, Inc., 566 
School testing program, 421-440 
college, 438 440 
elementary school, 433-436 
functions of, 422-429 
planning of, 429-430 
qualities desired in, 430-433 
secondary school, 436-438 
Seashore Measures of Musical Talent, 
263-204 
Self-observation, 18-19 
Semantic Test of Intelligence, 223 
Semi-interquartile range, 94 
Sentence completion tests, 409 
Shipley Personal Inventory, 391-392 
Short answer items, 57-58 
Silent Reading Comprehension: Iowa 
Every Pupil, 555 
Situational test, 20-21, 305-311 
Skewed distribution, 92-93 
Sociometric techniques, 355-360 
Spearman-Brown Prophecy Formula, 
129 


574 


Square root, computation of, 541-542 
Standard deviation, 95-100 
computation, 96 
interpretation, 97-100 
and normal curve, 97-100 
scores, 98, 99, 100 
Standard error of measurement, 124, 
132-134, 136-137, 536 
computation, 132 
interpretation, 132-134 
of revised Stanford-Binet, 136 
of Wechsler Intelligence Scale for 
Children, 137 
Standard score norms, 165-168 
computation of, 166 
definition, 166 
normalized, 167-168 
T-score, 168 
Standardized tests, schedule for evalu- 
ating, 146-149 
Stanford Achievement Tests, 285, 
288, 289, 290, 558 
Stanford Achievement Test: Reading, 555 
Stanford-Binet, 5, 136, 209-213, 214, 
219-220, 224, 227-230, 261 
Statistical symbols, 105 
Stoelting Co., C. H., 566 
Store Personnel Test, 527 
Strong Vocational Interest Blank, 374— 
377, 379, 380-381 
for men, 374-377, 562 
for women, 562 
Study of Values (Revised Edition), 562 


287, 


Teacher evaluation, 294-295 

Teacher-made tests, 26-78 

Temperament, 23, 383-394, 563-564 
definition, 23 
inventories, 383-384, 563-564 
measurement of, 383-394 

Terman, Lewis, 5 

Terman-McNemar Test of Mental Abil- 

ity, 550 

Test administration, 143-144 

Test blueprint, 32-34 

Test, characteristics of, 16 

Test information, sources, 189-200 
bibliographies of tests, 193-195 
journal test reviews, 195-196, 197-198 
textbooks, 191 192 


INDEX 


Test methods, 24 
Testing, history of, 5-6 
Testing program, 246-260, 421-440 
occupational, 246-260 
school, 421—440 
Tests, oral, 17-18 
Tests, teacher-made, 26-78 
Thematic A pperception Test (TA T), 406- 
408 
interpretation, 407-408 
scoring, 407 -408 
Thorndike, E. L., 5 
Thurstone Interest Schedule, 562 
Thurstone Temperament Schedule, 504 
Traxler High School Reading Test, 278, 
556 
Traxler Silent Reading Test, 556 
True-false items, 55-57 
True score, 538-539 
T-score, 168 
Turse Shorthand Aptitude Test, 202 


Understanding, measurement of, 66-69 
Use of test results, 182-185, 234-240. 
291-294, 510-528, 530-539 
in diagnosis and therapy, 530-539 

in personnel selection, 510-528 
in schools, 182-185, 234-240, 201-294 


Validity, 108-123, 256-260, 519-525 
of aptitude tests, 256-260 
of composite scores, 519-522 
concept, 108 
concurrent, 110, 115 
congruent, 110, 114-115 
content, 109, 110-112 
definition of, 108-109 
empirical, 110, 114-119 
predictive, 110, 116-119 
and preselection, 523-525 
rational, 109, 110-114 
statistical, 110, 114-119 

Validity coefficient, 120-122 
examples, 120 
interpretation, 119-122 
usefulness, 121-122 

Variability, 92-100, 533-536 


scores, 5° 


Е 15- 
of differences between 


536 
measures of, 92-100 


INDEX 575 


Vineland Social Maturity Scale, 352-353 
Vocational aptitude tests, 246-260 
Vocational testing, 246-260 


Wechsler- Bellevue Intelligence Scales, 
209, 213-214, 216, 227 


Wechsler Intelligence Scale for Children, 
213-214, 227 

Wing Standardised Tests of Musical In- 
telligence, 264 

World Book Company, 567 

Wundt, Wilhelm, 3 


Form No. 3. 
PSY, RES.L-1 


Bureau of Educational & Psychological 
Research Library. 


'The book is to be returned within 
the date stamped last. - 


2 5 АРЕ 1963 


7 LE O E N 


[6: J. 


WBGP-59/60-5119C-5M 


